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UNIT -1 : DATAWAREHOUSING ue PAGE Ng 
Saat Deer) O Oe waron On DATA WAREHOUSING 
Data preprocessing - Data cleaning, Data integration and 
transformation, Data eeo a ce ...(22to 44 
Data Warehouse Design : Data warehouse schema, Partitioning 
strategy, Data warehouse implementation, Data MANS nnerrrererrnnnrn( AA to 63) DATA WAREHOUSING — INTRODUCTION, DELIVERY 
Meta data, Example of a multidimensional data model, Introduction PROCESS, DATA WAREHOUSE ARCHITECTURE 
to pattern Warehousing essere Pe (63 to 74) 

0.1. Define data warehouse and write its characteristics. 
UNIT -2: OLAP SYSTEMS (R.GP.V., Dec. 2008) 
Basic concepts, OLAP queries, Types of OLAP servers, OLAP Ans. Data warehouses have been defined in many ways, loosely speaking, 
Seaborn aeee eee en (75to90) a data warehouse is a database that is no more than a collection of the key 
Data warehouse hardware and operational design : Security, Backup pieces of information used to manage and direct the business for the most 
and RECOVETY «<r a A eh F aE esssrensenstrisstessernensssssseeeseessee(90 tO 100) profitable result. A simple view of data warehouse is shown in fig. 1.1. The 


basic components of a data warehousing system are data migration, the 


UNIT -3: INTRODUCTION TO DATA& DATA MINING warehouse and access tools. 


Data types, Quality of data, Data preprocessing, Similarity measures, 


Summary statistics, Data distributions-....------ e eae eenean s ssri (101101108) 
Basic data mining tasks, Data mining v/s Knowledge discovery in 
databases, Issues in data mining-. i a ....(108 to 131) Operaia RARER 
_ Introduction to fuzzy sets and Fuzzy logic....... r OE T ...-..(131 to 134) Data Fee 
UNIT -4: SUPERVISED LEARNING 
Classification : Statistical-based algorithms, Distance-based 
algorithms, Decision tree-based AIGOTItHIMS .essnsosssesssseseessesseceseeesesseesseee(139 tO 163) Fig. 1.1 Data Warehouse 
Neural ee aes algorithms, Rule-based algorithms, een | The term data warehouse was first used by William H. Inmon in the early 
PTTTTTTITT Ee he om o man ies in the construction of data Warehouse sens He 
; efined data warehouse to be a set o e $ 
UNIT -5 ‘CLUST ERING &ASSOCIATION RULE MINING and is a subject-oriented, integrated Rig mae ee ay Rar Lae 
Hierarchical algorithms, Partitional algorithms, Clustering large but comprehensive definition hoe the main ch Sagas e. This short, 
databases : BIRCH, DBSCAN, CURE. algorithmSs...ssssesssssssseeseesseseeee+ee( 181 tO 210) warehouse. The four keywords time dependent, n : pi a Li on 
Association rules : Parallel and distributed algorithms such as subject-oriented differentiate data taioe ak n: i y a 
Apriori and FP growth algorithms ..ovcssssscssssssssssesecseereessrsssnussseasssssesserseee( 2100 224) systems, like file systems, relational database Bede Ree e ae 
ng 


R enis, Let’s take a closer look at each of these four keywords that differentiate 
ata warehouses from other data repository systems. 
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4 Data Mining and Warehousing 
(i) Time Dependent - 


This is one of the major aspects of th, 


ini ho storage of data is carried out 
i OS mining. The storage t t 
arehouse as it relates to data ka S 
W ~~ > information trom a historical perspective. In the data ee house, every 
a = ture has an clement of time cither implicitly or explicitly, 
ey struc $ : 
Nenvelatile — A data warehouse 1s always a physically separate 
+ i Jon =d ) i 7 ep 
; T transformed from the application data found in operational 
ae anit A data warehouse does not need transaction processing 
ronment. ¢ bi 
DEADS control mechanisms and recovery because of separation. 
i Integrated A data warehouse is organized by integrating multiple 
(iii) like: “onal databases, flat files, and online transaction 
heterogeneous sources, it is essential to integrate this 


i. in a data warehouse, : 
records. However, in it consistent. Data cleaning and data integration are the 


nena used to ensure consistency in naming conventions, encoding 
structures, attribute measures, and so On. 
s constructed around main 


ier. sales and product. For decision makers, data 


rarehouses si : 
ar g SARA i no purpose in the decision support process. 


In essence, a data warehouse is a consistent data store that used as a 


ysical implementation of a decision support data model and keeps the 
TEA that is used by an enterprise to make strategic decisions. A data 
: re. constructed by integrating data 


warehouse is also viewed as an architectu : 
from multiple heterogeneous sources to support structured or ad hoc queries, 
decision making and analytical reporting. 
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of many reasons like to make sure that you see the most relevant ads that you 
are most likely to click on or the friends that they suggest are the most relevant 


to you. 


0.3. What are the architectural components of a data warehouse ? 
(R.GP.V., Dec. 2002) 
Or 
Write in detail about the key components of data warehouse. 
(R.GP.V, June 2013) 


Ans. Various architectural components of a data warehouse are shown in 
fig. 1.2 and explained below - 
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Fig. 1.2 


(i) Load Manager — The load manager is the system component 


that performs all the operations required to support the extract and load process. 


On the basis of this information data warehousing is viewed as a process This Sien inay be constricted sie a enbata a eee 
of constructing and using data warehouses. The construction of a data pepake Elne Shell serints ant O proprimé “Theale aN bones: 
warehouse needs data integration, data cleaning and data consolidation. Some Joad mibtiaber will vale between specific sakitin Neue alee eg > e 
authors use the term “data warehousing” to refer only to the process of datai data warehouse bu E E ie dese aF ae wR to 
warehouse construction, whereas the term “warehouse DBMS” is used t0 source systems, the lareer ihe ay aie ui spate be = : ` — 


refer to the management and utilization of data warehouses. 


0.2. Give the important example of data warehouse. 
(R.GP.V., Nov. 2019) 


noting that third-party tools will probably contribute a maximum of 20-25% 
of the total system functionality. 


Fig. 1.3 shows the architecture of load manager. It performs the following 


Ans. A great example of data warehousing is what Facebook does. Operations — 


Facebook gathers all your data such as your friends, your likes, your groups 
etc, All these data are stored into one central repository. Although Facebook is 
storing all these information into separate databases, they store the most relevan! 


(a) Extract the data from the source systems. 
(b) Fast-load the extracted data into a temporary data store. 
(c) Perform simple transformations into a structure similar to 


and significant information into one central aggregated database. This is becaus? the one in the data warehouse 
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Load Manager 


Copy 
Management 
Tool 


Fig. 1.3 Load Manager Architecture 


(ii) Warehouse Manager — The warehouse manager is the system 
i, i 


component th rm: i ired to support the warehouse 
perf all the operations require ehou 

ae ee This system is typically constructed using a combination 
2 t ; 


of third-party systems manage- 
meni software, shell scripts. C 
programs and bespoke coding. 
The complexity of the warehouse 
manager is driven by the extent Controlling 


Warehouse Manager 


Temporary 
Data Store 


to which the operational manage- Process ae 
ment of the data warehouse has Stored 

ird- Procedures or 
been automated. Third-party eer 


tools will probably contribute a 


maximum of 40% of the total | | Backup/recovery Summary 


system, with the bulk of the cont- Too! z Tables 
ibuti tems manage- 

eae 2 a an Fig. 1.4 Warehouse Manager 

recovery/archiving. Architecture 


Fig. 1.4 shows the architecture of a warehouse manager. It performs the 
following operations — 
(a) To perform consistency and referential integrity checks, 
analyze the data. : 
(b) Transform and combine the source data in the temporary 
data store into the published data warehouse, 
(c) Create indexes, partition views, business views, business 
synonyms against the base data. 
(d) Generate denormalizations if appropriate. 
(e) Create any new aggregations that may be required. 
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(f) Update all existing aggregations 
(g) Back up incrementally or totally the data within the data 


warehouse 
(h) Archive data that has reached the end of its capture life 


(iti) Query Manager — The query manager is the systern component 
that performs all the operations required to support the query management 
process. This system is typically constructed using a combination of user 
access tools, specialist data warehousing monitoring tools. native database 
facilities, bespoke coding, shell scripts, and C programs. The complexity of 
query manager will vary between specific solution and is driven by the extent 
to which the facilities are provided by user access tools or native database 
facilities. Practically all the query manager is built in later development phases. 
Typically, the query manager is designed in the first build phase, once the 
database and user access tools technologies have been determined. 


Query Manager Stored Procedures | [Query Management|] Query Scheduling 
Query Redirection (Generating Tool via C Tool or 


via C, Tool, or RDBMS Views (for Monitoring RDBMS 
on the Fly) use of Third-party 

Indexes and Scheduler 

Summaries) Software 


Detailed 
Information 


Fig. 1.5 Query Manager Architecture 


Fig. 1.5 shows the architecture of a query manager. It performs the 
following operations — 


(a) Direct queries to the appropriate tables 

(b) Schedule the execution of user queries, 
In some cases, the query manager also stores query profiles to allow the 
warehouse manager to determine which indexes and aggregations are appropriate. 


0.4. What is the needs for developing data warehouse ? 


Ans. There are several needs for developing data warehouse as follows — 


; (i) A data warehouse is a system that stores data from a company’s 
operational databases as well as external sources. 
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(ii) Data ware 

databases because they store 

` to analyze data over a spe : a re 

leaders tO T re developed for strategic decision ceil and it is largely 
is is d EN ase. 

: g databases that make up the operational data se | 
create from ae DRAA E developed for decision support ‘queries 
= e that is needed for decision support is extracted from the 
bence om3 i house 

i stored in the ware : gehi 
operational data ma ing a data warehouse needs specialist knowledge of data 
; (v) Dev a “node! made from data required by users who wan, 
design because the e data design for the warehouse may be 


t high speed, and so th 
E OS from that of the operational database. 


in deliv rocess in data warehouse. teh 
0.5. eee ery es is a variant of the joint application development 
Ans. The deliv 


š . We have staged the 
livery of a data warehouse h 

approach en zy i ie itd minimize risks. The approach that we will 
data warehouse de ae the overall delivery time-scales but ensures the 
discuss here does no through the development process, 


: 2 livered incrementally : i 
AREE Sa moke into phases to reduce the project and delivery 
e K 


risk. The stages in the delivery process are shown in fig. 1.6. 


IT Strategy 


Warehousing 
re different from operationgj 


se platforms a 

house plattorn ent n o] 
ical informati s easier for 
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cific period of time. 
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Requirements 
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Fig. 1.6 Delivery Process 


Extending 
Scope 
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(i) IT Strategy — IT strategy is required to procure arfd retain funding 
for the project. Data warehouses are strategic investments that require a business 
process to generate benefits. 


(ii) Education and Prototyping — Organizations experiment with 
the concept of data analysis and educate themselves on the value of having a 


data warehouse before settling for a solution. This is addressed by prototyping. 
It helps in understanding the feasibility and benefits of a data warehouse. The 


prototyping activity on a small scale can promote educational process as long 
E 


(a) The prototype addresses a defined technical objective. 

(b) The prototype can be thrown away aftersthe feasibility 
concept has been shown. 

(c) The activity addresses a small subset of eventtal data content 
of the data warehouse. 

(d) The activity timescale is non-critical. a 

The following points are to be kept in mind to produce an early release 

and deliver business benefits. i 

(a) Identify the architecture that is capable of evolving. 

(b) Focus on business requirements and technical blueprint phases. 

(c) Limit the scope of the first build phase to the minimum that 
delivers business benefits. 

(d) Understand the short-term and medium-term requirements 
of the data warehouse. 

(iii) Business Case Analysis — In this phase, the objective of business 
case is to estimate business benefits that should be derived from using a data 
warehouse. These benefits may not be quantifiable but the projected benefits 
need to be clearly stated. If a data warehouse does not have a clear business 
case, then the business tends to suffer from credibility problems at some 
stage during the delivery process. Therefore in data warehouse projects, we 
need to understand the business case for investment. 


as — 


(iv) Business Requirements — 

(a) By understanding the business requirements for both the 
short and medium term, we can design a solution that satisfies the short-term 
need, but is capable of growing to the full solution. 

(b) At least 20% of the time should be spent on understanding 
the likely longer-term requirement. 

(c) Within this stage,.we must determine — 


(1) The logical model for information within the data 
warehouse. 
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(2) The source systems that provide this data (that jg | 


i ` a: 
= (3) The business rules to be applied to data. . 


(4) The query profiles for the immediate requirement, ; 
; hnical Blueprint — This phase needs to deliver an overall. 
(v) Te ts. This phase also deliver ` 


e long-term requiremen 
be implemented on a sho! 
entifies the following — 


architecture satisfying th 
the components that must 
business benefit. A blueprint id 
(a) The ov erall system 
(b) The backup and recovery 
(c) The data retention policy. 
(d) The capacity plan for hardware and infrastructure. 


(e) The server and data ma 
(f) The components of database design. 


the first production deliverable 
st component of a data 


rt-term basis to derive any ` 
strategy. 


rt architecture. 


(vi) Build the Vision — In this phase, 
is produced. This production deliverable is the smalle 
warehouse. This smallest component adds business benefit. 

(vii) History Load - This is the phase where the remainder of the 
aded into the data warehouse. In this phase, we do not 
but additional physical tables would probably be created 
ta volumes. The backup and recovery procedures may 
re it is recommended to perform this activity within 


required history is lo 
add any new entities, 
to store increased da 
become complex, therefo 
a separate phase. 

(viii) Ad hoc Query — In this phase, we configure an ad hoc query 
tool that is used to operate a data warehouse. These tools can generate the 
database query. It is recommended not to use these access tools when the 


database is being substantially modified. 


(ix) Automation — In this phase, operational management processes 


are fully automated. These would include — 

(a) Monitoring query profiles and determining appropriate 
aggregations to maintain system performance. 

(b) Generating aggregations from predefined definitions within 


the data warehouse. 
(c) Extracting and loading data from different source systems. 


(d) Transforming the data into a form suitable for analysis. 
(e) Backing up, restoring, and archiving the data. 


s7 


architecture. "i 
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(x) Requirements Evolution ~ From the perspective of defivery 
process, the requirements are always changeable, They are not static. The 
delivery process must support this and allow these changes to be reflected 


within the system. 
This issue is addressed by designing the data warehouse around the use 
of data within business processes, as opposed to the data requirements of 
existing queries. a As 
The architecture is designed to change and grow to match the business 


‘needs, the process operates as a pseudo-application development process, 


where the new requirements are continually fed into the development activities 
and the partial deliverables are produced. These partial deliverables are fed 
back to the users and then reworked ensuring that the overall system is 
continually updated to meet the business needs. 


(xi) Extending Scope — In this phase, the data warehouse is extended 
to address a new set of business requirements. The scope can be extended in 


two ways — 
(a) By loading additional data into the data warehousing. 


(b) By introducing new data marts using the existing information. 


0.6. Discuss the architecture of data warehouse.(R.GP.V., June 2008) 
Or 
Discuss three-tier architecture for data warehouse. (R.G P. V., Dec. 2008) 
Or 
Briefly describe 3-tier data warehouse architecture. (R.GP.V., June 2011) 
Or 
Describe the overall and typical architecture of data warehouse. 
(R.GP.V, June 2013) 
Or 
Describe the term data warehouse architecture. 
(R.GPV, Dec. 2010, June 2014) 
Or 
With a neat sketch explain the architecture of a data warehouse. 
(R.GP.V., May 2019) 
Or 
Write a short note on data warehouse architecture. 
(R.GP.V., Nov. 2019) 
Ans. Dat i i i 
a warehouse uses a three-tier architecture, as shown in fig. 1.7. 
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Fig. 1.7 A Three-tier Data Warehousing Architecture 
(i) A warehouse database server is the bottom tier, that is mainly a 
ational ional databases or other 
i tem. To supply data from operational 
ee emacs end tools and utilities are used. These 


external sources to the bottom tier, back- | 
tools and utilities carry out data extraction, data cleaning, nati 
and load and refresh functions to keep up to date data warehouse. An application 
program interfaces called gateways are used to extract data. A gateway 1s 
supported by the underlying DBMS and permits client programs to produce 
SQL code to be executed at a server. ODBC (open database connection) and 
OLEDB (Open Linking and Embedding for Database) by Microsoft, and JDBC 
(Java Database Connection) are examples of gateways. The bottom tier hasa 
metadata repository that keeps information about the data warehouse and its 


data transformation, 


contents. . 
(ii) An OLAP server is the middle tier that is implemented using 


either a multidimensional OLAP (MOLAP) model or relational OLAP (ROLAP) 
model. A MOLAP model is a special server that directly implement 
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multidimensional data and operations. While, a ROLAP mode! is an exterided 
relational DBMS that maps operations on multidimensional data to standard 
relational operations 
(iii) Front-end client layer is the top tier, that maintains analysis tools, 
query and reporting tools and data mining tools. 
Q.7. What is data warehouse ? Explain the data warehouse architecture 
with diagram. (R.GP.V., May 2018) 
Or 


What is data warehouse ? Discuss a three-tier data warehouse 


architecture. (R.GP.V., Dec. 2003, 2009) 
Or 


What is data warehouse ? Discuss a three-tier data warehouse. 
(R-GP V., June 2014) 


Ans. Refer to Q.1 and Q.6. 


Q.8. Draw the data warehouse architecture and explain its components. 
(R.GP.V., June 2016) 


Ans. Refer to Q.6 and Q.3. 


0.9. Differentiate between two and three tier architectures of data 
warehousing. (R.GP.V., Dec. 2004) 


Ans. The two-tiered data warehouse is a “fact” client model. The top tier 
is a client containing functions user interface, data access, report formatting, 
query specification, data analysis, and aggregation. The bottom tier is a 
warehouse server, which performs data logic, file services, data services, and 
maintains metadata. 

The two-tiered architecture lacks the scalability and flexibility of the three- 
tiered model. The three-tiered architecture solves the scalability and flexibility 
issues of the two-tiered data warehouse. In three-tiered architecture, middle 
layer (i.e., application servers) performs data filtering, aggregation, and data 
access, support metadata; and provide multidimensional views. Alternatively, 
application servers can be data mart servers, with all the benefits of a dependent 
data mart environment. The top and bottom layers have the same functions as 
in two-tiered architecture. 


Q.10. Discuss system development life cycle of a data warehouse. What 
factors should be considered while designing a data warehouse ? 
(R.GP.V., June 2015) 


Ans. The data warehouse development life cycle approaches just like any 
software development life cycle. The various phases of system development 
life cycle of a data warehouse are as follows — 


d Warehousing 


This is the first phase of a data warehouse Proje 
project schedule and various tasks Whig, 
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(i) Planning — 1's | 
Planning describes goals, objectives, 
can help in the progress of the project. | 

(ii) Requirements — This phase specifies the basic requiremen, 


‘ect. This can be clearly understood 
ired to carry out the project. y 
A ae with the customer for gathering requirements. 
o! 
(iii) Analysis — This phase d 
and data mart models. It als 
data warehouse, data sourc 


This phase deals with the detailed design of the project, 


eals with the development of logica 
rehouse o defines the processes Which arg 

} sane connect es and tools together. 
(iv) Design- 

(v) Construction — In this phase modification is done after 
performing testing of those processes which are developed 3 the design phase 
This phase validates data extraction and transformation functions and also 


confirms data quality. 


(vi) Deployment — 
after undergoing through all the phases. 
(vii) Maintenance — This is the most important phase of data 
warehouse life cycle. The warehouse needs to be maintained from time to 
_ time so as to keep the performance up to the mark. 
The main factors that need to be considered while designing a data 


warehouse include heterogeneity of data sources, use of historical data, 
increasing size of databases and business driven nature of data warehouse, 
Besides, some other factors which should be considered in designing a data 
warehouse are data content, metadata, data distribution, tools and performance. 


Q.11. What are the differences between the three main types of data 


warehouse usage — information processing, analytical processing and data 
mining ? (R.GP.V., June 2009, 2017) 


Ans. There are three types of data warehouse applications as follows - 


(i) Information Processing — It supports basic statistical analysis, 
querying, and reporting using crosstabs, charts, tables or graphs. In data 
warehouse information processing, a current trend is to construct low-cost 
Web-based accessing tools that are then integrated with Web browsers. 


(ii) Analytical Processing — It supports basic OLAP operations - 
drill-up, drill-down, slice, dice and pivoting. Generally, it works on historical 
data in both summarized and detailed forms. The major strength of OLAP 
over information processing is the multidimensional data analysis of data 
warehouse data. 


This phase finally deploys the data warehouse 


OIG 
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(iii) Data Mining — It supports knowledge discovery by performing 
ediction, discovering hidden patterns and associations 


classification and pr , : 
1 models, and presenting the mining results using 


constructing analytica 

visualization tools. 
Q.12. “A data warehouse is built on historical data and is not guaranteed 

to be up-to-date information”. Comment. (R.GPV.,, Dec. 2004) 


Ans. The data warehouse is designed for strategic decision support, and 
is largely built up from the database that make up the operational database. 
The basic aspect of data warehouse is that it has large amount of data, which 
can means billions of records. The basic structure of a data warehouse is 
ed by some specific rules. One of them is, it is nonvolatile. According 
data warehouse is always a physically seperate store of data transformed 
from the application data found in operational data environment. A data 
warehouse does not need transaction, processing, recovery and concurrency 
control mechanism due to this separation. It needs only two operations in data 
accessing initial loading of data and access of data. It means that, data in a 
data warehouse is never updated but used only for query. Thus such data can 
only be loaded from other database like operational database. End user who 
require to update data must use operational database, as only later can be 
updated, changed or deleted. This means that a data warehouse will always be 
filled with historical data. Hence we can say that “A data warehouse is built on 
historical data and is not guaranteed to be up-to-date information”. 


govern 
to this, 


Q. 13. Give three examples of problems likely to be encountered when 
operational data are integrated into the data warehouse. (R.GP.V., Dec. 2011) 
Or 

Give reason, why it is necessary to separate data warehouse from 
operational database. (R.GP.V., June 2016) 


Ans. The integration of operational data into the data warehouse can 
encounter three types of problems such as ~ 


(i) Inconsistently Encoded Data — Data that is not encoded 
consistently can create the problem of lack of integration. One simple example 
of lack of integration is data as shown by the encoding of gender. In one 
application, gender is encoded as 0/1. In another, it is encoded as m/f. In yet 
another, it is encoded as a/b. Of course, in the data warehouse, it doesn’t 
matter how gender is encoded as long as it is done consistently. 


(ii) Semantic Field Transformation — Another problem which is 
encountered during the integration is semantic field transformation. Say that 
the same field is used in five applications under five different names. A mapping 
from the various source fields to the data warehouse fields must occur to 
transform the data to the data warehouse appropriately. 


tions deed 
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(iii) Legacy Data - Legacy data ts used in several dif ferent forma 

der several different DBMSs. Some legacy data 1s u “a els se lega, 
ne is under IMS, and still other legacy data Is under _ All of theg 


` sy protect brought forward into a singl 
SR ‘as must contain the data they p ; Singh 
technologies must © populate the data warehouse. This type of translation of 


technology in order to e the dat 
technology is not always straightforward. 

z tems and also write t) 

i t the advanced database sys ; 

bat R sar a database system with example. (R.GP.V., June 2 01 


Ans. The advanced database systems can be categorised in different types. 


y PS ases and Spatiotemporal Databases — Spatial 
a a information. Geographic databases, medical ang 
ae databases, and VLSI or computed-aided design databases ar 
Soca seed f atial databases- Raster format may represent spatial data 
perg ay F, eoi pixel maps or bit maps. The representation of 
making up nay by vector format, where buildings, roads, bridges, and lakes 
oaks eve or overlays of fundamental Sera pees such 
i i iti d networks formed by these 

as polygons, points, lines, and the partitions an 


components. 
A spatiotemporal database 1 


spatial objects that change with tim 
from spatiotemporal database. 

(ii) Data Stream Management Systems — Several applications 
involve the analysis and generation of a new kind of data, known as stream 


data, where flow of data in and out of an observation platform dynamically, 


Typical features of data streams include flowing in and out in a fixed order, 
permitting only one or a small number of scans, huge or possibly infinite 


volume, dynamically changing, and demanding fast changing time. Web click 
streams, network traffic, stock exchange, and telecommunications are some 
of the examples of data streams. 

(iii) Object-relational Databases — Based on an obj ect-relational data 
model, object-relational databases are formed. By giving a rich data type for 
handling object orientation and complex objects, this model extends the relational 
model. In applications and industry, object-relational databases are becoming 
very popular because most complicated database applications need to handle 

complex structures and objects. 


(iv) Heterogeneous Databases and Legacy Databases — A hetero- 
geneous database is made up of a set of autonomous, interconnected component 
databases. The communication between the components takes place in order to 
exchange information and answer queries. Objects in one component databasé 


s formed by a spatial database that stores 
e. Interesting information can be mined 
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may differ greatly from objects in other component databases, making it hard to 
understand their semantics into the overall heterogeneous database 

A group of heterogeneous databases forms a legacy database. In legacy 
database, different kinds of data systems are combined, such as network 
databases, multimedia databases, hierarchical databases and object oriented or 


relational databases. 
(v) Web based Global Information Systems — The World Wide Web 


and its associated distributed information services provide worldwide, rich, 
on-line information services, where interactive access is facilitated by linking 


data objects together. 

(vi) Text Databases and Multimedia Databases — Databases which 
contain word descriptions for objects are known as text databases. These 
word descriptions are long sentences or paragraphs, such as summary reports, 
warning messages, error or bug reports, product specifications, or other 
documents. Text databases may be categorized in highly unstructured, 
semistructured, and well structured databases. 

Multimedia databases store image, audio, and video data. Typical 
applications of multimedia databases include video-on-demand systems, the 
World Wide Web, voice-mail systems, picture content-based retrieval and 
speech based user interfaces that identify spoken commands. Large objects 
must be supported by multimedia databases, because data objects like video 


may need gigabytes of storage. 

(vii) Temporal Databases, Sequence Databases and Time-series 
Databases — Data that contain time-related attributes are stored by a temporal 
database. These attributes may include several timestamps, each containing 
different semantics. 

The sequencing of events in which they occurred is stored by a sequence 
database. A sequence database stores the sequencing with or without a concrete 
notion of time. Biological sequences, Web click streams, and customer shopping 
sequences are some of the examples of a sequence database. 

The sequencing of values or events received over repeated measurements 
of time is stored by a time-series database. Typical examples of time-series 
database include inventory control, the observation of natural phenomena, and 
data gathered from the stock exchange. 

Q.15. Discuss in brief various types of data warehouse. 

(R.GP.V., Dec. 2002) 


Or 
Compare enterprise warehouse, data mart and virtual warehouse. 
(R.GP.V., June 2010) 
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Or 


Describe three data warchouse mod T is sarehouse, th 
is — The enterprise ware ë 

| i pate irtmal warehouse. en (R.GPM., June 2015) 
viria. 

i data mart and GP. 


i odels with an example. 
; Briefly explain the data warehouse m (R.GP.V., Nov. 2019) 


f data warehouse models. They are as follows — 


rtitions of the overall data 
(i) Data Mart — Dey sas al sizes, and may be created 
1 warehouse. Data marts come re improve query performance, simply 
i for a number of reasons. Data that needs to be scanned to satisfy a query. A 
by reducing the volume of data rate-wide data that is valuable to a particular 
data mart has a subset of a to specific chosen topics. For example, a 
group of users. The cores ; = subjects to customer, item, and sales. In data 
marketing data mart may limit its Sì j. The implementation of data marts are 
seh earns ea - tal servers such as UNIX/LINUX or Windows- 

one 


based. The data mart implementation cycle is probably measured in weeks. 


Enterprise _ An enterprise warehouse is cross functional 
ed O EE subjects spanning the whole 
ie ieee ee aa warehouse may be implemented on traditional 
EEEN ni See pitti rs, or parallel architecture platforms. It offers 
an ie- bani ion, usually from one or more operational systems or 
external meaag e This model contains both detailed data and 
summarized data, and can range in size from a few gigabytes to terabytes. It needs 
extensive business modeling and may take long time to design and build. 


(iii) Virtual Warehouse — A virtual warehouse 5 a virtual A of 
databases, allowing the creation of a “virtual warehouse as e to - 
physical warehouse. In a virtual warehouse, you have a logical descrip ion o 
all the databases and their structures, and individuals who requires information 
from those databases do not have to know anything about them. This method 

creates a single “virtual database” from all the data resources. The data resources 
can be local or remote. In this type of a data warehouse, the data is not moved 
from the sources. Instead, the users access the data directly. This direct access 
to the data is sometimes done by simple SQL queries, view definition, or data- 
access middleware. With this approach, it is possible to access remote data 
sources including major RDBMSs. The virtual data warehouse scheme lets a 
client application access data distributed across multiple data sources through 
; a single SQL statement, a single interface. All data sources are accessed because 
they are local users and their applications do not even need to know the physical 
location of the data. 


Ams. There are three types © 
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There is a great benefit in starting with a virtual warehouse, since many 


organizations do not want to replicate information in the physical data warehouse 
Some organizations decide to provide both by creating a data warehouse 
containing summary-level data with access to legacy data for transaction details. 
A virtual data warehouse is easy to build but requires excess capacity on 
operational database servers. Its performance can be considerably degraded 
because the queries must complete with the production data transactions. Since 
there is no metadata, no summary data or history, all the queries must be repeated, 
creating an additional burden on the system. Above all, there is no clearing and 
refreshing process involved, causing the queries to become very complex. 


Q.16. Explain the concept of data warehouse and data mart with an 
example. (R.GPRV., June 2012) 
Ans. Refer to Q.1 and Q.15 (i). 


0.17. Data warehouse used some tools. What are the Junctions of them ? 
Or 


Explain the different tools required to manage a data warehouse. 
(R.GP.V., May 2018) 
Ans. The back-end tools and utilities are used by data warehouse systems 
to populate and refresh their data. The functions of these tools and utilities are 
as follows — 
(i) Data Extraction — Collects data from multiple, heterogeneous, 
and external sources. 
(ii) Data Cleaning — It is used to find errors in the data and corrects 
them as required. 
(tii) Data Transformation — Used to transform the data from legacy 
or host format to warehouse format. 
(iv) Load — It summarizes, sorts, consolidates, checks integrity, 
computes views, and builds indices and partitions. 
(v) Refresh — It propagates the updates from the data sources to 
the warehouse. 
In addition to cleaning, loading, refreshing, and metadata definition tools, 
data warehouse systems offer a good set of data warehouse management 


tools. To improve the quality of data, there are two important steps data cleaning 
and data transformation. 


0.18. Discuss data warehouse functions and explain the data flow within 
data warehouse. (R.GP.V., Dec. 2002) 


Ans. The processes shown in fig. 1.8, are correspond to the data flows 
within a data warehouse. 


and lead 


Archive data 
Fig. 1.8 Process Flow Within a Data Warehouse 


Extract and Load Data — Data extraction takes data from source 
® 


‘ : Data load takes extracteq 
i Jable to data warehouse. 
systems and makes it aval 


: is held in 
=> _ In operational systems, data is a 
data and loads it into data warehouse SAL support the data/performance 


` ` stem. ‘ 

form that is a the original information content will have 

requirements of the opera over the years when we extract data from physical 

P AE P RA : th data into the data warehouse, this information 

database. Before loadin 

content must be reconstructed. : d : 
j . fined as data with context and meaning. 

In essence, information can be de 


load process must take data and add context 
The data warehouse extract and T value-adding business information. 


ingi transform it : 
ses ck aiti this is obtained by extracting the data from the 
vires eles =e loading it into the database, stripping out any detail that is 
iki tthe tional system rather than the business requirement, 
iki ce est at then reconciling the data with other sources. 

(a) Controlling the Process — This is a mechanism that 
determine when to extracting the data, run the transformations and consistency 
checks and so on. 

(b) When to Initiate the Extract — Source data should be 
extracted when it represents the same instance of time as the extracts from 

the other data sources. 

(c) Data Loading — After extracting the data from source 
systems, it is loaded into a temporary data store in order for it to be cleaned up 

and made consistent. 

(d) Copy Management Tools and Data Cleanup — When the 
source systems do not overlap much, and the consistency checks are simplistic, 

a copy management tool will cut down the coding effort required. If this is not 
the case, a copy management tool may not add sufficient value to justify purchase. 


(ii) Clean and Transform Data — This is the system process that 
takes the loaded data and structures it for query performance, and for minimizing 
operational costs. There are a small number of steps within the process — 


aa 
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(a) Clean and transform the loaded data into a structure that 
speeds up queries. M 
(b) Partition the data in order to optimize hardware performance, 
speed up queries, and simplify the management of the data warehouse 
(c) Generate aggregations to speed up the common queries 


(iii) Backup and Archive Process — Similar to operational systems, 
the data within the data warehouse is backed up regularly to guarantee that the 
data warehouse can always be recovered from data loss, software failure or 
hardware failure. 


(iv) Query Management Process — The query management process 
is the system process that manages the queries and speeds them up by directing 
queries to the most effective data source. This process must also ensure that 
all the system resources are used in the most effective way, usually by scheduling 
the execution of queries. The query management process may also be needed 
to monitor the actual query profile. Then, this information would be used by 
the warehouse management process to determine which aggregations to create. 
Generally, query management does not work during the regular load of 
information into the data warehouse. This process operates at all times that 
the data warehouse is made available to end users. There are a set of facilities 
that are constantly in operation are as follow — 


(a) Directing Queries — Data warehouse that contain summary 
data potentially provide a number of distinct data sources to respond to a 
specific query. These are detailed information itself, and any number of 
aggregations that satisfy the query’s information requirement. 


(b) Maximizing System Resources — Regardless of the 
processing power available to run the data warehouse, it is all too possible that 


asingle large query can soak up all system resources, affecting the performance 
of the whole system. 


(c) Query Capture — As users become used to the facilities 
provided by a data wavehouse, they will change the types of queries they ask. 
This is unavoidable and should be encouraged, since it shows that users are 
exploiting the information content of the data warehouse. 


0.19. What are steps involved in clean and transformation of data ? 


(R.GP.V., June 2016) 
Ans. Refer to Q.18 (ii). 


0.20. Describe the query management process. (R.GP.V., June 2013) 
Ans. Refer to Q.18 (iv). 
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e REAPECCING — DATA CLEANING, DATA 
REED TRANSFORMATION, DATA REDUCTION 


data in data mining ? 
(R.GP.V., June 2009) 


nhance the data quality before applieg 
rE high-quality mining results is known 
mining with the > hniques can significantly enhan 

* ta rocessing techniq : ce 
as data a ae waned and the time needed for the actual 
the overall quality 0 cleaning, data integration, datg 


Š = ludes data 
rocessing inc g can be performed to eliminate 


mining. Data prep 3 i 
: tion. Data cleanin È i 
transformation, and da'a ee in the data. Data integration combines data 


noise and correct inco store, such as a data warehouse. 
from multiple eee ae a E S atin the data into suitable forms 
Data aus ee on can reduce the data size by aggregating, eliminating 
for mining. Data E Justering to obtain a reduce representation of the data 
redundant features mee volume, yet generates the same analytical results, 
= =. oraaa not mutually exclusive. They may work together. 

ta is needed ? 


0.22. Why preprocessing of da Or 


r data preprocessing by taking example of an 
(R.GP.V., June 2017) 


sy data are usual properties of large 
There are several reasons due to 


Q.21. Discuss in brief, why do we preprocess 


A technique that is used to € 
oa aim that data will lead to 


Evaluate the need fo 
application of your choice. 
Ans. Incomplete, inconsistent, and noi 


-world databases and data warehouses. 
Sees data can take place. Some attributes may not always be 


present, like customer information for sales transaction data. Other data may 
not be included since it was not important at that time. Relevant data may not 
be stored because of misunderstanding, or equipment malfunctions. Missing 
data may need to be deduced, especially for tuples with missing values for 
some attributes, Data that were inconsistent with stored data may not be 
recorded. In addition, the recording of the history or modifications to the data 
may have been overlooked. Incorrect data may also result from inconsistencies 
in naming conyentions or data codes used or inconsistent formats for input 
field, like date. Noisy data can take place for several reasons. There may have 
been human or computer error occurring at the time of data entry. There may 
be technology limitations, such as limited buffer size for coordinating 
synchronized data transfer and consumption. The data collection instruments 
used may be faulty. Errors in data transmission can also occur. Low-quality 
data will lead to low-quality mining results. Therefore, to enhance the data 
quality and consequenty of the mining results, data preprocessing needed. 


O lc 


Data Warehousing 23 


0.23. Describe various methods for data preprocessing. 
(R.GP.V., Nov./Dec. 2007) 


Or 
Discuss briefly the various data preprocessing technique. 
(R.GP.V., June 2008) 
Ans. The various methods for data preprocessing are as follows — 


(i) Data Cleaning — Data cleaning routines attempt to fill in missing 
values, smoothing noisy data, specifying or eliminating outliers, and correct 
inconsistencies in the data. In case of dirty data, the users will not trust the 
results of any data mining that has been applied to it. Also, the data that are 
dirty can create confusion for the mining procedure, thus providing unreliable 
output. However, most mining routines adopt some procedures to cope with 
incomplete or noisy data, they are not always robust. 


(ii) Data Integration — Data integration is a useful preprocessing 
step in which we integrate data from multiple databases, data cubes, or files 
into a coherent data store for our analysis. Although, some attributes may 
have different names in different databases, results in inconsistencies and 
redundancies. The attribute for product name may be referred to as prod_name 
in one data store and Pname in another, for example. Due to this large amount 
of redundant data, the knowledge discovery process may slow down or create 


confusion. 

(iii) Data Transformation — Data transformations routines are 
additional data preprocessing procedures. Data transformation operations are 
applied to transform the data into appropriate forms for mining. For example, 
attribute data may be normalized so as to fall between a small range, such as 
0.0 to 1.0. 


(iv) Data Reduction — Data reduction techniques are used to obtain 
areduced representation of the data set while minimizing the loss of information 
content. There are a number of techniques for data reduction, namely, data 
aggregation — building a data cube, numerosity reduction — replacing data by 
alternative, dimensionality reduction — using encoding schemes, and attribute 
subset selection — removing irrelevant attributes. Data can also be reduced by 
generalization with the use of concept hierarchies. For example, low level 
concepts, such as data for sale are replaced with higher level concepts, such 
as weak, months, or quarter. Fig. 1.9 shows the different forms of data 
preprocessing. 
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Fig. 1.9 Forms of Data Preprocessing 
preprocessing of data is required ? What are the various 
es ? (R.GPV,, Dec. 2010, June 2014) 


Or 


What i need of data rocessing ? Discuss various forms of 
gs * mp (R.GP.V., Nov. 2019) 


PTEPTOCESSINE- 
Ans. Need of Data Preprocessing — Refer to Q.22. 


Forms of Data Preprocessing — Refer to Q.23. 


0.25. In real world data, tuples with missing values for some attributes 
are a common occurrence. Describe various methods for handling this 
problem. 

Ans. Consider that, analyzation of AllElectronics sales and customer data 
is needed. Many tuples have no recorded value for several attributes like 
customer income. How can be gone about filling in the missing values for this 

attribute ? Following procedures are given below — 

(i) Ignore the Tuple— When the class label is missing, this method 
is performed. Unless the tuple contains several attributes with missing values, 
this method is not very effective. If the percentage of missing values pet 
attribute varies considerably, this method is especially poor. 
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(ii) Fill in the Missing Value Manually — Generally, this method is 
time consuming. This method may not be feasible provide a large data set with 
many missing values. 

(iii) Use a Global Constant to Fill in the Missing Value — Replace 
every missing attribute values by the same constant such as a label like 
“nknown” or — ©. When missing values are replace by, say, “unknown,” 
then the mining problem may mistakenly think that they form an interesting 
concept, since they all have a value incommon that of “unknown”. Thus, 
however this method is easy, it is not foolproof. 


(iv) Use the Attribute Mean to Fill in the Missing Value — For 
example, consider that the average income of AllElectronics customers is 
=10,000. This value is used to replace the missing value for income. 


(v) Use the Attribute Mean for All Samples Related to the Same 
Class as the Given Tuple — For example, when classifying customers according 
to credit_risk, replace the missing value with the average income value for 
customers in the same credit risk category as that of the given tuple. 


(vi) Use the Most Probable Value to Fill in the Missing Value — 
This may be determined with regression, inference-based tools using a Bayesian 
formalism, or decision tree induction. For example, a decision tree is created 
to predict the missing values for income using the other customer attributes in 
dataset. 
Procedures (iii) to (vi) bias the data. The filled-in value may not be 
appropriate. In comparison to the other procedures, procedure (vi) uses the 
most information from the present data to predict missing values. There is a 
greater chance that the relationships between income and the other attributes 
are preserved by considering the values of the other attributes in its estimation 
of the missing value for income. In some condition, a missing value may not 
imply an error in the data. For example, if applying for a credit card, candidates 
may be asked to supply their driver’s license number. This field is leaved blank 
by candidates who do not have a driver’s license number. Forms should permit 
respondents to specify values like “not applicable”. Software routines may 
also be used to uncover other null values like “don’t know”, “?’’, or “none”. 
Ideally, each attribute should have one or more rules regarding the null condition. 
Whether or not nulls are permitted may be specified by the rules. How such 
values should be handled or transformed may also be specified by the rules. 
When fields are to be given in a later step of the business process, it may also 
be intentionally left blank. Thus, although we can try our best to clean the data 
after it is seized, good design of database and of data entry procedures should 
help minimize the number of missing values or errors in the first place. 
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x noisy data ? Explain Y ; 
re = r or variance in a measured variable, 
how can we smooth out the data 
re are various data smoothing 


various data smoothing methods, 


Ans. Noise refers to a random oe 
Provide a numerical attribute like = Bite ee 
to remove the noise ? For smoothing the ; 


‘hich are given below — 
Seine methods smooth a sorted data value by 
the values around it. The sorted values 


g methods perform local smoothing 


(i) Binning — These 
consulting its “neighbourhood”, mean, 


i ins. Binnin re 
are distributed into pe ES of values. Some binning methods are 
because they consu 


3 the data for price are sorted, then, 
illustrates in fig. pan o ee three. In smoothing by bin means, 
partitioned into eguan laced by the mean value of the bin. For example, the 
each value in a bin 1s rep a 10 in Bin 1 is 8. Hence each original value in this 
mean of the values > k =a 8. Similarly, smoothing by bin medians can be 
bin is replaced by = value is replaced by the bin median. The minimum and 
used, in which each in ovided bin are identified as the bin boundaries in 
maximum values in a Bs Then, each bin value is replaced by the closest 
smoothing by canta ae larger the width, the greater the effect of the 
BED ech, bin can be equal width, where the interval range of 
smoothing. 3 


values in each bin is constant. 
Sorted Data for Price 
Partition into (Equal-frequency) Bins — 


(in Dollars); 5, 9, 10, 21, 21, 24, 25, 28, 34 


Bin 1:5, 9, 10 
Bin 2:21, 21, 24 
Bin 3:25, 28, 34 


Smoothing by Bin Means — 


Bin 1:8, 8, 8 
Bin2:22, 22, 22 
Bin 3:29, 29, 29 


Smoothing by Bin Boundaries — 


Bin 1:5, 5, 10 
Bin 2:21, 21, 24 
Bin 3:25, 25, 34 


Fig. 1.10 


(ii) Regression—Data can be smoothed by fitting the fah to a function, 
like with regression. Linear regression has finding the “best line to fit Ai 
attributes, so that one attribute can be employed to predict the other. Multiple 
linear regression is an extension of linear regression, where more than two 
attributes are included and the data are fit to a multidimensional surface. 
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(iii) Clustering — Outliers can be detected by clustering, where simi lar 


values are organized into a clusters. Intuitively, values which fall outside of the 
set of clusters can be considered outliers, as shown in fig. 1.11 


Fig. 1.11 


0.27. Write about data cleaning as a process ? Are there any tools out 
there to help ? 


Ans. Discrepancy detection is the first step in data cleaning as a process. 
Discrepancies can be caused by several factors including poorly designed 
data entry forms which have various optional fields, human error in data entry, 
deliberate errors (e.g., respondents not wanting to divulge information about 
themselves), and data decay (e.g., outdated addresses). Discrepancies may 
also arise from inconsistent data representations and the inconsistent use of 
codes, Errors in instrumentation devices which record data, system errors, 
are another source of discrepancies. If the data are employed for purposes 
other than originally intended, errors can also occur. There may also be 
inconsistencies because of data integration. Another source of errors is field 
overloading which generally results if developers squeeze new attribute definitions 
into unused (bit) portions of already defined attributes. The data should also be 
examined regarding unique rules, consecutive rules, and null rules. 


There are various different commercial tools of discrepancy detection. 
Data scrubbing tools to detect errors and make corrections in the data by 
using simple domain knowledge. These tools depends on parsing and fuzzy 
matching methods when cleaning data from various sources. Data auditing 
tools find discrepancies by analyzing the data to discover rules and relationships, 
and detecting data which violate this type of conditions. They are variants of 
data mining tools. Some data inconsistencies may be corrected manually using 
external references. Data transformations is the second step in data cleaning 
as a process. That is, once discrepancies is determined, transformations is 
require to define and apply to correct the problem of discrepancies, 
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G) Data Analysis ae detailed data analysis is required. In addition 


inconsistencies are to be Se ia or data samples, analysis programs should be 
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toa manual inspect t the data properties and detect data quality problems, 
ee Transformation Workflow and Mapping Rules — 
(ii) Definition of ese their degree of heterogeneity and 
Depending on the number = oe oe of data transformation and cleaning 
the “dirtyness” of the data, a Sancti: a schema translation is used to 
steps may have to be execu! del: for data warehouses, typically a relational 
map sources to acne ae al eaning steps can correct single-source 
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(iii) Verification — The correctness and effectiveness of a 
transformation workflow and the transformation definitions should be oe 
f the source data, to improve the 

d evaluated, e.g., on a sample or copy O | roven 
definitions if necessary. Multiple iterations of the analysis, design and veri io 
steps may be needed, e.g., since some errors only become apparent after applying 


some transformations. 


(iv) Transformation — Execution of the transformation steps a 
by running the ETL workflow for loading and refreshing a data warehouse 
during answering queries on multiple sources. 


Commercial tools can assist 
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(v) Backflow of Cleaned Data — After (single-source) errors 
removed, the cleaned data should also replace the dirty data in the original 
gources in order to give legacy applications the Improved data too and to avoid 


redoing the cleaning work for future data extractions. For data warehousing 
the cleaned data is available from the data Staging area. i 


are 


Q.29. Describe the issues to be considered during data integration 


(R.GPV, June 2010, 2014) 
Ans. Following issues are to be considered during data integration — 


() Schema Integration and Object Matching — The entity 
identification problem is the matching of equivalent real-world entities from 
multiple data sources. For example, how the user or the computer determine 
that the prod_name in one data store and Pname in other data Store refer to the 
same attribute ? The name, meaning, data type, and range of values permitted 
for the attribute, and null rules for handling blank, zero, or null values are the 


examples of metadata for each attribute. Such metadata are used to discard 
errors in schema integration. 


(ti) Redundancy — An attribute derived from another attribute or set 
of attributes may be redundant. Redundancies in the data Store may also occur 
due to inconsistency in attribute or dimension naming. Some redundancies 
can be detected by correlation analysis. In addition to detecting redundancies 
between attributes, duplication should also be detected at the tuple level. Another 
source of data redundancy is the use of denormalized tables. Because of 
inaccurate data entry or updating some instead of all occurrences of the data, 
inconsistencies often arise between various duplicates. Consider, for example, 
sales database that contain attributes supplier name and location instead of a 
key to this information in sales database, discrepancies can take place such as 
the same supplier’s name appearing with different items. 


ore 


(tit) Detection and Resolution — The detection and resolution of 
data value conflicts is another important issue in data integration. For instance, 
attribute value from different sources are different for the same real world 
entity because of differences in representation, encoding, or scaling. For 
example, a weight attribute may be stored in metric units in one system and 
British imperial units in another, An attribute in one system may be recorded at 
a lower level of abstraction than the same attribute in another. Special attention 
must be paid to the data structure when attributes from one database matched 
to other database during integration. This is to make sure that any attribute 


functional dependencies and referential constraints in the source system match 
those in the target system. 
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data. These techniques include regression, clustering pad binning. 
(iii) Normalization — This operation is applied when the attribute 
data are scaled so as to fall within a small specified range, such as 1 to 10, 
(iv) Aggregation — This operation is used where summary op 


aggregation operations are performed to the data. For example, the daily sales 
data may be aggregated so as to determine weakly, monthly, quarterly ang 


annual total amounts. 
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hierarchies. For example, attributes, like s 
country. 
0.31. What is meant by data transformation 


n — This operation is performed when low-leve| 
y higher-level concepts using concept 
treet, can be generalized to city or 


? What are the various 


ways of data transformation ? (R.GP.V., June 2010) 
Or 

Discuss in detail about data transformation. (R.GP.V., June 2011) 
Or 


What is meant by data transformation ? (R.GP.V., June 2014) 


Ans. Refer to Q.23 and Q.30. 


0.32. Describe various forms of data normalization. Also specify their 
value ranges. (R.GP.V., Dec. 2010) 


Ans, The normalization of an attribute is done by scaling its values, as a 
result they fall within a small specified range, like 1 to 10. Normalization 1s 
employed for classification algorithms involving neural networks or distance 
measurements like nearest-neighbour classification and clustering. Normalizing 
the input values for each attribute measured in the training tuples will help 
speed up the learning phase, if using the neural network backpropagation 

algorithm for classification mining. For distance-based methods, normalization 
helps prevent attributes with initially large ranges from out weighing attributes 
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with initially smaller ranges. Some methods for data normalization are as 
follows — 

(i) Decimal Scaling Normalization — In this normalization moving 
the decimal point of values of attribute A, that rely on the maximum absolute 
value of A. The value, v, of A is normalized to v' by calculating 

v 
10' 
ere, i denotes the smallest integer such that Max(|v'|) < 1 


v= 


wh 

(ii) Min-max Normalization — On the original data, min-max 

normalization performs a linear transformation. Consider that the minimum and 

maximum values of an attribute, A are min a and max 4. Min-max normalization 

maps a value, v, of A to v' in the range [mew_min,, new_max,] by calculating 

v—min, ( 
y=... — (lew max, — new i i 
ae aah A A _ min, ) + new_ ming 

If a future input case for normalization falls outside of the original data 

range for A, min-max normalization will encounter an out-of-bounds error. It 
does not change the relationships among the original data value. 


(iti) Z-score Normalization — It is also called zero-mean normali- 
zation. In this normalization the values for an attribute, A, are normalized on 
the basis of mean and standard deviation of A. Suppose that the mean and 
standard deviation ofan attribute Aare A and o a: A value, v, of A is normalized 
to v' by calculating 
VEN 

SA 
Z-score normalization is used when there are outliers that dominate the 


min-max normalization or when the actual minimum and maximum of attribute 
A are unknown. 


y = 


i 0.33. What do you mean by data reduction ? What are the strategies of 
the ‘data reduction ? — (R.GP.V., June 2015) 
7s kh Pion Or $ Be 
Write short note on data reduction. 

; Or 
‘Define data reduction. Explain different technique for data reduction. 

7 i (R.GPV, May 2018) 
a n Data reduction techniques are performed to get a reduce data set 
at is much smaller in volume, yet produces the same analytical results. That 


(R.GPV., Dec. 2004) 


— 
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Jata reduction 


i sed for ü 
Following strategies are used f 


(i) Data Cube Agere ict 
rformed to the data m the cons OE 
are pè pset Selection — In this strategy weakly relevant 
(ii) Attribute oe ordimensions May be detected and eliminateg 
irrelevant, Se a Reduction — In this strategy encoding mecha. 
crease the size of data set. 
A “, Reduction — In this strategy the data are replaceg 
jumerosity cane data representations like parametric models 
) thods “hich includes histogram, clustering, and sampling 
Sos | 
3 ization and Generation of Concept Hierarchy 4 
(v) Data CRET s for attributes are replaced by ranges or higher 
ta va T mining of data at multiple levels of abstraction, 
conceptual levels. They allow On lines: 


: sity red 
. Sae a form of numero z : siecle 
Data discretiza eee of concept hierarchies. Data discretization and 
the automatic ge 


ta mining. 
generation of concept hierarchy are powerful tools for da g 


: aggregation strategy. 
0.34. Discuss ioe R ee of sales per month, for the years 2006 
Ans. Assume t the annual sales instead of sales per month. Therefore, 
to 2009. But, we like ‘40 that the resulting data summarize the total sales per 
the th as shown in fig. 1.12. The resulting data set is 
poh oo sem loss of information content necessary for the analysis 
sma > 


task. 


In this strategy aggregation operations 
on of a data cube. 


Year 2009 
Year 2008 | 
Year 2007 | bo 
Year 2006 ie 0 4 
0 
() 
| Months | Sales ho P0 fo 
of 711,000 
January | $25,000 o po $711, 
February | $35,000 PO fo po $854,000 
March | $40,000 f° fo po $665,000 
April | $50,000 PO fo pe $554,000 
May | $15,000 PO bo po 
June $65,000 a po 
$85,000 2° fo po g 
$75,000 f° fo bo 
$70,000 P? fo 
$73,000 PO 
$83,000 Po 


$95,000 
Fig. 1.12 Sales Data are Aggregated from Month to the Annual Sales 


nev z g 
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Multidimensional aggregated data are stored in data cubes. For instance, 
ta cube for multidimensional analysis of sales data 


with respect to annual 
sles per item type for each branch as depicted in fig. 1.13. Here, each cell 

sales 

stores an aggregate data value, corresponding to the data point in 


multidimensional space. For each attribute, there exist a concept hierarchy 
that allows the analysis of data at multiple levels of abstraction. For example, 
a hierarchy for location could allow branches to be grouped into states, on the 
pasis of their cities. Data cubes offer fast access to precomputed, summarized 
data. The cube formed at the lowest level of abstraction is known as the base 
cuboid that corresponds to an individual entity of interest like sales or customer. 
The cube formed at the highest level of abstraction is known as the apex 
cuboid. The lowest level should be usuable or useful for the data analysis. On 
the other side, the apex cuboid 
would give one total — the total 
sales for all the years, for all item 
types, and for all locations. Data 
cubes generated for varying 
levels of abstraction are known 
as cuboids, therefore a data cube 
may refer to a lattice of cuboids. 
Each higher level of abstraction 
further reduces the size of 
resulting data set. The smallest 
available cuboid relevant to the 
given task should be used when 
replying to data mining requests. 


2006 2007 2008 2009 
Year 


Fig. 1.13 A Data Cube for Sales 


0.35. Why do we need subset of an attributes ? What are its goal and 
how can we find a good subset ? 


Ans. For data analysis, the data set may contain number of attributes, 
some of them are redundant and not required for the mining task. For example, 
if the task is to determine customers as to whether or not they are likely to 
purchase a popular new game at shop, attributes such as age or interest are 
relevant. However, attributes like phone number are to be irrelevant. Also, it 
can be possible for a domain expert to select some of the useful attributes, but 
it become a hard and time-consuming task, especially in the situations when 
the behaviour of the data is unknown. Keeping irrelevant attributes or leaving 
relevant ones may be harmful, causing confusion for the mining algorithm 
used. As a result patterns of poor quality are discovered. Besides, the addition 
of redundant or irrelevant attributes can reduce the efficiency of mining process. 
Therefore, we need a subset of an attributes that reduces the size of data set 
by removing redundant or irrelevant attributes. 


oe ill 
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The goal of atribute subset selection 1s to select a set of attributes i 


that the resulting probability distribution of the data classes 18 as similar to i 
original distribution obtained using all attributes: The number of atribu 
appearing in the discovered patterns Ar? decreased by mining on a reduced y 


of attributes helping to make the patterns intuitive. 

There are total 2" possible subsets for n attributes, To find an Optim, 
subset of attributes, heuristic methods are commonly used. Typically, the 
methods are greedy in that, while searching through attribute space, th 
always make what looks to be the best choice at the time. Their strategy is, 
make a locally optimal choice in the hope that this will result in a global 
optimal solution. Pragmatically, these methods are effective in use and m 
come close to estimating an optimal solution. Tests of statistical significan ( 
are used to determine the best and worst attributes. Such tests consider th, 


the attributes are independent to each other. 


0.36. Explain in detail about data cube aggregation and attribute suby 


selection. (R-GP.V., June 201 


Ans. Refer to Q.34 and Q.35. 


0.37. What are the basic heuristic methods of attribute subset selection 


Ans. The basic heuristic methods of attribute subset selection are , 


follows - 

(i) Forward Selection ~ 
attributes is taken as the reduced set. At each subsequent 
best of the original attributes is added to the reduced set. 


EK (ii) Stepwise Backward Elimination — \n this method, the full s 
of attributes is taken as the initial reduced set. Afterwards, it deletes the won 


attribute present in the set at each step. 


(ii) Combination of Forward Selec 
in this method, both the above methods are combined so that, at each step th 
procedure chooses the best attribute and deletes the worst from among tl 
~ remaining attributes. | 

Seapets (iv) Decision Tree Induction ~ Decision tree algorithms were original 
-intended for classification. Some algorithms are CART, 1D3, and C4.5. In th 
J. a tree is constructed from the given data. In this tree, each intèrn 
each branch corresponds’ 
a class predictia 


In this method, initially the empty set, 
iteration or step, th 


ion and Backward Elimination 


eyal 
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2.38. What do you understand by dimensiona lity » y 
é 


wo methods of dimensionality reducti yg Megs 
ie a ly reduction, (R.GPV, Jı A 
. Vey JUNE 2010 2014 
7 J} 


Ans. 10 dimensionality reduction, to obtain a reduc ad or“ 
representation of the original data, data encoding sides E compressed” 
performed. The data reduction is lossless when the ori rationi are 
reconstructed from the compressed data without loss of i ae data can be 
However, the data reduction is lossy if we reconstruct a omahan content. 
of the original data. There are number of elegant ees approximation 
compression. These algorithms are typically lossless RA Ka ue 

bs ow only 


limited manipulation of the data. Two popular and effective method 
dimensionality reduction are as follows — ethods of lossy 


95 


ü Wavelet Transforms — The discrete wavelet transf 
is a linear signal processing technique. This method transforms iai vee 
X to a numerically different vector, X', of wavelet coefficient Š eure 
are similar in length. Here, we consider each tuple as an pares Mice 
vector, i.e. X = (Xp, X27: Xn) when applying to data vector. The ees 
of this technique lies in the fact that the wavelet transtimidd data Pais 
shortened. By keeping only a small fraction of the strongest of the ae 
coefficients, a compressed approximation of the data can be retai St ae 
technique also works to eliminate noise without smoothing out the m a eee 
of the data, thereby effective for data cleaning as well. If a set cE ne 
is given, an approximation of the original data can be found by appl Rie 
inverse of the DWT used. y app yng ae 
The DWT is closely related to the discrete Fourier transform (DFT), Th 

DFT is a signal processing technique involving sines and cosines, The DWE 
obtains better lossy compression. It means that DWT offer a more accurat 
approximation of the original data when the same number of coefficients is 
retained for a DWT and a DFT. Therefore, the DWT needs less space j 
compared to DFT for an equivalent approximation. In DWT, wavelets are 
quite localize in space, contributing to the conversation of local detail Popular 
wavelet transforms are the Haar-2, Daubechies-4 and Daubechies-6 aot 
Wavelet transforms can be applied to multidimensional data, such as a dala 
cube. This method provides better results on sparse or skewed data and on 
data with ordered attributes. Wavelet transforms have several real-world 
applications like data cleaning, computer vision, analysis of time series data 
and compression of fingerprint images. Wavelet transforms are suitable for 
high dimensionality data. 


(ii) Principal Component Analysis (PCA) — This method 5 > 


= known as Karhunen-Loeve method. Consider that the data to be i. 


comprise tuples or data vectors described by n attributes. This method sv 


Qa 
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for v n-dimensional orthogonal vectors that are used to express the dat 
where v <n. Thus the original data are projected onto a ae smaller SPagy 
causing dimensionality reduction, By creating an alternative, smaller se $ 
we PCA “combines” the essence of attributes. Afterwards, the in 


z $ itiq 
IE set. PCA often shows relationshi 
data can be projected onto this smaller set. PCA often shows onships tha 


suspected and hence permits interpretations that Wouk 


wer not previously 
inanty result. i 
x one be applied to ordered and unordered attributes and can hang, 
dans and skewed data. Principal component analysis is computationay, 
sree ive, Multidimensional data of more than two dimensions cay 4, 
sca z transforming the data in two dimensions. PCA tends to be bette; a 
handling sparse data. sonality reducti 
a sous dimensio. reduction technique, 
0.39. Explain in detail abou (R.GPV, June 2017, 
Ans. Refer to Q38. 


0.40. Write the general procedure for the following — 
(i) Wavelet transforms : 
(ii) Principal components analysis. 

Ans. (i) Wavelet Transforms — A hierarchical pyramid algorithm is use 
by the general procedure for applying a discrete wavelet transform. Thi 
algorithm halves the data at each iteration for fast computational speed. Th, 
general procedure is as follows — 

i (a) The input data vector of length (L) must be an integer powe 
of 2. This condition can be satisfied by padding the data vector with zeros a 
required (L 2 n). ; 
(b) Two functions are performed by each transform. The firs 
function carries out some data smoothing, like a sum or weighted average 
The second function carries out a weighted difference which acts to bring ow 
the detailed features of the data. eee 

(c) The two functions are performed to pairs of data points in 
X, that is, to all pairs of measurements (Xj, X7; +1): This leads to two data set; 
of length L/2. Generally, these data sets show a smoothed or low-frequency 
version of the input data and the high-frequency content of it, respectively. 

(d) Until the resulting data sets obtained are of length 2, the two 

functions are recursively performed to the data set obtained in the previous 
loop. 

(e) The wavelet coefficients of the transformed data are designated 
by the selected values from the data sets obtained in the above iterations. 
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(ii) Principal Component Analysis — The general proced 
cedure is 


as 
follows ~ ee 
(a) The normalization of input data are performed, as | 

> aS a result 


each attribute falls within the same range. This step helps to make s 
attributes with large domains will not dominate attributes w ith smaller dies oh 
(b) A basis for the normalized input data js rovided ide 
P rthogonal vectors computes by PCA. These are unit EE eee ti y 
a direction perpendicular to the others. These vectors are called th a 
onents. A linear combination of the principal components is an principal 
(c) In order of decreasing strength or significance aoe ch 
components are recorded. The principal components use as a new eee: 
for the data, giving information about variance. That is, the sorted axe pe 
such that the first axis represents the most variance among the data, the s : A 
axis represents the next highest variance, and so on. This informatio Tei 
identify groups or patterns within the data, op 
(d) The data size can be reduced by eliminating the weaker 
components sincè the components are sorted according to decreasing order 


of strength. It should be possible to reconstruct a good approximation of the 
original data using the strongest principal components. 


comp! 


0.41. What do you understand by numerosity reduction ? 


Ans. Numerosity reduction techniques are used to reduce the size of dat 
set by choosing alternative, that is smaller forms of data representation These 
techniques are of two types parametric and nonparametric, For parametric 
methods, a model is used to estimate the data, so that typically only the data 
parameters need to be stored, instead of the actual data. For exaniple, log- 
linear models estimate discrete multidimensional probability distributions Ear 
storing reduced representations of the data nonparametric methods entail 
sampling, clustering and histograms. 


0.42. What is histogram ? How are the buckets determined and the 
attribute values partitioned ? 


Ans. Histograms are popular for data reduction. Histograms use binning 
to estimate data distributions. A histogram for an attribute (A) partitions the 
data distribution of attribute into disjoint subsets. These disjoint subsets are 
refetred to as buckets. The buckets are known as singleton buckets, if each 
bucket represents only a single attribute-value/frequency pair. i 


There are several attribute values partitioning rules, some of them are as 


follows — 


(i) Equal-frequency — The frequency of each bucket is constant in 


an equal frequency histogram. This is also known as equidepth or equal-depth. 


equal width histogram. 


of adjacent values. A buc 
pairs containing the 8- | larg 
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(ii) Equal-width — The sath of each bucket range is uniform in an 


the difference between each Pai 
blished between each pair fo 
B represents the numba 

t 


iip MaxDify — Here, we consider 
i g ket boundary is esta 


sst differences, where, 


of buckets specified by USS : is the one with th 
: SoS Re The V-optimal histogram is X 1th the leas 
: ie < eas all of the possible histograms for a given number of 
iA righted sum of the original values that each TA koo 
uc 5 Eee ee ts the total number of y 
; > here bucket weight represen alues 
histogram variance, where 
in the bucket. 


Q.43. Write short note on clustering techniques. 


Ans. Data tuples are considered as objects in clustering Se They 
evorize the objects into groups OF clusters, with the aim that objects within 
mes are of same type and differ from objects in other clusters, On the 
basis of a distance function, similarity is defined as how ‘close” the objects 
are in space. The “quality” of a cluster may be determined by its diameter, the 
maximum distance between any two objects 1n the cluster. Centroid distance 
is an another method to determine the quality of cluster. The average distance 
of each cluster object from the cluster centroid is centroid distance. 

The cluster representations of the data in data reduction are used to change 
the actual data. This technique is more effective for data that can be organize; 
into different clusters. The effectiveness of this technique relies on the nature 
of the data, 

Multidimensional index trees are mainly used for providing fast data access 
in database systems. These trees can also be used for hierarchical dat, 
reduction, providing a multiresolution clustering of the data. This can be used 
to give estimate answers to queries. For a given set of data objects, an indey 
tree recursively divides the multidimensional space, where the root node 
represents the whole space. Typically, these trees are balanced and contain 
internal and leaf nodes. Each parent node has keys and pointers to child nodes 
that collectively show the space represented by the parent node. Each leal 
node has pointers to the data tuples they represent. Hence, an index tree can 
store aggregate and detail data at varying levels of resolution or abstraction. A 

hierarchy of clusterings of the data set is provided by an index tree, where 
each cluster contain a label that holds for the data contained in the cluster. 


s 0.44. Why sampling is used as a data reduction technique ? What is 
the advantage of sampling for data reduction ? 


"Ans, Sampling is used as a data reduction technique since it represents? 
jarge data set by a much smaller random sample of the data. Consider, D 


._ i 


J 
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represents a nee eee eetetbat wontains N tuples and S is the size of 
sample. The size of the sample (s) is less than N. Some Stier: size of the 
methods that we could sample D for data reduction are as follows. ST 


(i) Simple Random Sample Without RS 
gnis is create d by genine S of ne N tuples eee ee f 
ing any tuple is 1/N. It means that al ie 
sig y tup l tuples have equal probability to be 
(ii) Simple Random Sample With Replac 
method is same as the SRSWOR, except that each ti 
p, itis stored and then replaced. It means that after a 
back in D to redraw again. 


ement (SRSWR) — This 


mea tuple is drawn from 
tuple is drawn D, it is put 


(iii) Cluster Sample — Simple random sam 
obtained if tuples in a large data set (D) are grou 
clusters, where S < M. For instance, tuples are 
each page is used as a cluster. By applying SRS 
data representation can be achieved, providing 
We can also explored other clustering criteria 


ple of S clusters can be 
ped into M mutually disjoint 
retrieved as a page, so that 
WOR to the pages, a reduced 
a cluster sample of the tuples. 
that convey rich semantics. 
(iv) Stratified Sample — A startified sample of large data i 

created by obtaining an SRS at each stratum, if D is partitioned aan EAE 
disjoint parts known as strata. This helps ensure a E i SA 
particularly when skewed data are used. For instance, a stratified Lre 
be obtained from customer data, where a stratum is generated E 


customer age group. In this way, the age group that contains the least numb 
of customers will be sure to be represented, ET 


Advantage — A benefit of sampling is that the cost of obtaining a sampl 

is proportional to the size of the sample, S. Therefore, sampling co E as e 
sublinear to the size of the data. For a fixed sample size, sampling x ee = 
increases only linearly when the number of data dimensions. n Hee: 
Sampling is commonly employed to approximate the answer to an aggre a 
query when applied to data reduction. Sampling is a natural choice hee, 
progressive refinement of a reduced data set. Such a data set can be further 
refined by simply increasing the sample size. 


0.45. Describe discretization and concept hierarchy. (R.GP.V., Dec, 2009) 

Or 
Explain in detail about data discretization and concept hierarchy 
generation. (R.GP.V., June 2017) 
Ans, For a given continuous attribute, data discretization ti chniques are 
used to reduce the number of values by splitting the range allone ie 


eof 


intervals. Then, these interval labels can be used to change actual + 


ge actua 


eee mmr arte ee E 
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ate can be simple and shorten due to replacing ais oe 
The original data can È yx small number of interval labels. This result 
ofa continuous me michel representation of mining results. 
peme, es eats en be classified on ae of how the 
Discretization If the discretization process uses class intormat; 
discretization is performed. ization. else itis called unsupervised discretizatig, 
itis called supervised oe titing refers to the process starts by first finding 
Top-down discretization a £ pe ints or splits to split the entire attribut 
one or a few E a ia recursively on the resulting interval, 


then Te à : to the proces ; 
saa O kia discretization or merging refers to tae p = Stare by 


: es as potential split-points, eliminate, 
considering all = we ane to ll intervals, and then repeats thi; 
some by merging au resulting intervals. On an attribute, discretizatio, 
process recursively on m eR ffer a hierarchical or multiresolution 
can be performed ae values called a concept hierarchy. For a given 
partitioning of the a t hierarchy defines a discretization ofthe attribute, 
numerical attribute, a concep reduce the data set size by using low-levg 


t hierarchies are used to ; 
PEET lace of high-level concepts. The generalized data may be mop 
concepts in p: such data generalization may loy 


meaningful and E ocean representation of data minin 
some details. ss a mining tasks, which is a common requirement. Apay 
results among m tip a reduced data set needs lesser input/output operation, 
the ei Se cineca Due to these benefits, discretization techniques anj 


3 . ining as a preprocessi 
concept hierarchies are typically ERTS a aan peP 


Ley 
N iù 


On, 


ee 1.14 shows a concept hierarchy for the an a in dollars 
Here, to satisfy the requirement of several users, more than one ae 
hierarchy can be defined for the same attribute price. For a ee or on 
expert, manual definition of concept hierarchies can be a te se an im 
consuming task. Therefore, various discretization metho s are use K 
automatically generate or dynamically refine concept hierarchies for numerica 


attributes. $ 


(S1500...52000] 


$1500...) |($1750... 


Fig. 1.14 A Concept Hierarchy 


> a Bies: 
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Q.46. Briefly discuss the following — 
(i) Reasons Jor data partitioning 
(ii) Discretization 
(iii) Ice-berg query. 


(R.GPV., Dec. 2011) 


Ans. (i) Reasons for Data Partitioning — The reasons for data Partitioning 


re described under the following heads — 
a 


(a) Partitioning for Ease of Mana 


fact table, we make it easy to assist the Management of the data. Due to the 
opowing size of fact table in a data warehouse from gigabytes to many hundres 
a fei gabytes or even beyond a terabyte in size, there is always a need to partition 

(b) Partitioning to Assist Recovery — Partit 
o essential for backup recovery. Restoring or backu 


ine fact data would be a major undertaking. 


gement — By Partitioning the 


1oning approach is 


als p a table having all 


onl 
(c) Partitioning for Performance — 


The elimination of large 
arts of the fact table from the possible set of data that needs to be scanned can 


breaking up the fact table 
Partitions that are relevant. 


pe the query performance. This is obtained by 


into partitions, and then requiring the query to scan the 
(ii) Discretization — Refer to Q.45. 


(iii) Ice-berg Query — The term ice-berg query was defined to 
characterize a class of OLAP queries that retrieve aggregate values above 
some specified threshold t (defined by a HAVING clause). An ice-berg query 
in SQL is written as — 

SELECT prod, city, sum(qty) 
From sales 

GROUP BY prod, city 
HAVING sum(qty) > = 500; 


In above query, from all groups of prod and store locations (cities), the 
user wants only those having aggregate result equal to or greater than t = 500. 
The motivation is that the data analyst is often interested in exceptional aggregate 
values that may be helpful for decision support. A typical query optimizer 
would first perform the aggregation for each (prod, city) group and then 
return the ones whose aggregate value exceeds the threshold. 


0.47. Describe various concept hierarchy generation strategies for 
numerical data. 


Ans. For numerical attributes, concept hierarchies are generated 
automatically on the basis of data discretization. Here, each method consider 
that the value to be discretized are kept in increasing order, 
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ta smoothing technique. It is a top, 


i inning — Binning is a da Shins 
geared hat rater ona given number of bins. Binning ig, 
down splitting techmique that « yse itdoes not use class informati On 


beca 
unsupervised discrenizaton technique sthing. These methods are also useg a 


ue 

Binning methods are used for a apt hierarchy? generation and numerosi 
discretization methods E re values can be discretized by applying equa) 
reduction. For instance, attribute after replacing each bin value by the bin 
width or equal<depth binmng. ther sie by bin means or smoothing by bin 
mean or median, * = rN SERD hierarchies, repeat these technique, 
recursively to the resulting parūnons ; 

3 Analysis — Histogram analysis does not use class 

(ii) Histogram nalysts pervised discretization technique. For ap 


lue into disjoint ranges. These ranges 


scribate (A), Diemer Cee. the values are divided into equa], 

are called bockets. In 2 a an equal-frequency histogram, each partition of 

sized partitions or range umber of data tuples. To produce a multilevg 

the values has the ea. : analysis algorithm recursively to each 

concept hierarchy. repeat ach eno after a specified number of level, 

amei with = Sail also be divided on the basis of data distribution 
ve armiv istogram 


cluster analysis. Pee 
iii) Entropy-based Di ization — Entropy iscretization was 
ones BAEN de EEE a mame work on information theory and 
the concept of information gain. This is the most common discretization 
measures, Entropy-based discretization is a top-down splitting technique. It is 
ised discretization technique because it uses class information. It 
eae ci distribution information in its calculation and determination of 
split-points. To discretize a numerical attribute (A), the method chooses the 
value of A that contains the minimum entropy as a split-point, and recursively 
divides the resulting intervals to reach at a hierarchical discretization. Such 
discretization creates a concept hierarchy for an attribute (A), 

(iv) Interval Merging by Z Analysis — ChiMerge is a 2-based 
discretization method. This method is supervised because it uses class 
information, This method uses a bottom-up approach by determining the best 
neighboring intervals and then merging these intervals recursively to form 
larger intervals. The relative class frequencies should be fairly consistent within 
an interval for accurate discretization. Hence, when two adjacent intervals 
have a very similar distribution of classes, then these intervals are merged. 
Otherwise, they should remain separate. 
pease wi SE each Saa vate ofa numerical attribute (A) is considered 
to be one interval. 7” tests are carried out for every pair of adjacent intervals. 


i 
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— 2 values for a pair of adjacent Intervals Specify simil 
eer ibutions, adjacent intervals with the least y? Values a 
dis scones of merging continues recursively until ap 
Th ountered. 


ass 
are merged together 
oe ae respecified stopping 
©) Cluster Analysis — This isa Popular data discretization meth 
A clustering algorithm can s used to discretize a numerical rph mai 
dividing the values of A Ea Eia Or groups. Clustering takes the distribution 
of A into consideration pua e closeness of data points. Hence, ít is able to 
erate high-quality discretization results, A Concept hierarchy for an attrib 
gen gen erated by using clustering. Here, concept hierarchy follows eith = 
rottom-up merging strategy or top-down splitting strategy, where each rea 
T ode of the concept hierarchy. In the bottom-up strategy, clusters are 
; voduced by repeatedly grouping neighbouring clusters, forming higher level 
Phe hierarchy. However, in the top-down strategy, each cluster may be 
d ecomposed into several subclusters, forming a lower level of the hierarchy. 


0.48. What do you understand by categorical data ? How the concept 
of hierarchy can be generated for categorical data 2 
Or 
Explain about concept of hierarchy generation for categorical data. 


(R.GP.V., June 2011) 

Ans. Categorical data are discrete data. Categorical attributes contain limited 

number of distinct values, but there is no ordering among them. For example 

item type, job category, and geographic location. Some methods for the 
generation of concept hierarchies for categorical data are as follows — 


(i) Specification of a Partial Ordering of Attributes Explicitly at 
the Schema Level by Users or Experts ~ Typically, concept hierarchies involve 
a group of attributes for categorical attributes or dimensions. A concept hierarchy 
can be easily defined by user or expert by specifying a partial or total ordering 
of the attributes at the schema level. As an example, a relational database or a 
dimension location of a data warehouse has the following attributes — street, 
city, province_or_ state, and country. A hierarchy can be defined by specifying 
the total ordering among these attributes at the schema level, such as street < 
city < province_or_state < country. 


cn 


(ii) Specification of a Portion of a Hierarchy by Explicit Data 
Grouping — It is the manual definition of a portion of a concept hierarchy, It is 
unrealistic in a large database to define a complete concept hierarchy by explicit 
value enumeration. On the other hand, explicit grouping for a small portion of 
intermediate-level data can be easily specified. For instance, after specifying 
that province and country form a hierarchy at the schema level, a user could 
define some intermediate levels manually, like “{Alberta, Saskatchewan, 
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«sRetsh Columbia, Prairies Canag 
{Brush a} 


Manitoba} c Prairies_Canada” and 


Western_Canada”. silt af Sel of Attributes, but not of their Pap. 
Ordering - A set at mam to explicitly state their partial ordering. Then Syste 
by a user, but it leave © create the attribute ordering to form a Meaninghy 
can try to automatically -ararchy can be automatically created on the 


ae per attribute in the given set of attribut 


basis of number of differe imum number of different values is kept 
The attribute that EH the Ta the attribute that contains the minimun 
the lowest level of values is at the higher level of concept hierarchy. In Mog 
number of reir vam rule works well. After examination of the generate 
e ENESEST swapping or adjustments may be applied by usep 
hi Y à 


involved in design and construction of 
(R.GP.V., Dec. 2003) 


0.49. What are the various steps 


data warehouse ? 
x Or 


j. i i a data warehouse. 

Explain the design and construction of (RGPY, May 2019 
Ans. There are three approaches to build a data warehouse — a top-down 
approach, a bottom-up approach, or a combination of both. The various step 
involved in the design and construction of a data warehouse are Planning 
requirements study, problem analysis, warehouse design, data integration ani 
testing, and finally deployment of the data warehouse. Two methodologies ar 
commonly used for developing large software systems, namely, the waterfal 
method or the spiral method. The waterfall method performs a structured ani 
systematic analysis at each step before going to the next and falling from one 
step to the next. The spiral method involves the rapid generation of increasingly 
functional systems, with short intervals between successive releases. The 
spiral model is a better choice for data warehouse development, especially for 
data marts, since the turnaround time is less, modifications can be performed 
easily, and new designs and technologies can be adapted in a timely manner, 

Following steps are involved in the warehouse design process z 
G) Selecta business process to model. For example, sales, shipments, 
invoices, inventory, account administration, orders, or the general ledger. À 
data warehouse model should be followed if business process is organizational 


È 
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— multiple complex object collections. Al 
ie vlecte d if the process is departmental and 
sempre Of business: process. 
fo 


0 (ii) Select the grain of business process. For thi 


though, a data mart mode] 


emphasizes on the analysis 


(iii) Select the dimensions that will use for 
stomer. supplier, status, time, item, warehouse an 

cu iypical dimensions. 

the (iv) Select the measures that will populate each fact table record. T 

aasures are numeric additive quantities such as dollars ae 
The data warehouse implementation sco 


pecause its construction is a difficult and long-term task. Once a data warehouse 


desi ened and constructed, the initial deployment of the warehouse involves 
wit al installation, roll-out planning, training, and orientation. 


0.50. Explain in brief about conceptual modeling, 


Ans. Multidimensional modeling and OLAP work loa 
design techniques. In the context of design, 
modeling that provides a higher level of abstraction in describing the 
warehousing Process and architecture in all its aspects, aimed at achieving 
independence of implementation issues. Conceptual modeling is widely 
recognized to be the necessary foundation for building a database that is well- 
documented and fully satisfies the user requirements: 


ies ri ; usually, it relies on a 
graphical notation that facilitates writing, understanding, and managing 


conceptual schemata by both designers and users. 

Unfortunately, in the field of data warehousing there still is no consensus 
about a formalism for conceptual modeling. The entity/relationship ( E/R) model 
is widespread in the enterprises as a conceptual formalism to provide standard 
documentation for relational information systems, and a great deal of effort 
has been made to use E/R schemata as the input for designing nonrelational 
databases as well; nevertheless, as E/R is oriented to support queries that 
navigate associations between data rather than synthesize them, it is not well 
suited for data warehousing. Actually, the E/R model has enough expressivity 
to represent most concepts necessary for modeling a DW; on the other hand, 
in its basic form, it is not able to properly emphasize the key aspects of the 
multidimensional model, so that its usage for DWs is expensive from the point 
of view of the graphical notation and not intuitive. 


Some designers claim to use star schemata for conceptual modeling. A 
star schema is the standard implementation of the multidimensional model on 
relational platforms; it is just a (denormalized) relational schema, so it merely 
defines a set of relations and integrity constraints. Using the star schema for 


each fact table record, 
d transaction types are 


: _sold and units_sold 


pe should be Precisely defined 


ds require specialized 
a basic role is played by conceptual 


. 
2 
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conceptual modeling is like starting (© build a complex software by writin 
code, without the support of and static, functional, or beac Model, y 
typically leads to very poor results from the points of view of adhere 
user requirements, of majatenance, and of reuse. 

Q.51. Explain star schema with example. 


Or 


x ons and attributes in the star Schem 
dimensions a, 
) 


Explain the use ef fac. (R.GRV, Dec. 20) 


schems t ` i digm. In star 

ng is the most commen modeling paradig sche 
calle has a large central table known as fact table containing g 
pre data with no redundancy, and a set of smaller attendant table 
oa as dimension table, one for each dimension. This schema graph lag 
bie a star in shape, with the dimension tables displayed in a radial Patten 

d the central fact table. , 
phone example, consider a star schema of sales with four dimensions _ 


time. item, branch, and location. This schema also has a Se fact table tha 
, ite to each of the four dimensions, along with two Measures _ 


intains ke 
ae ae peel its sold. Fig. 1.15 illustrates a star schema for sales. 
= a Item 
$ E Dimension Table 


item_key 
aa eam 
supplier_type 
Location 


Dimension Table 


Dimension Table mesoa) (location key 

— rete | 

Ligne Soy) | ebb 
= ci 

ranch pam = 

Senge province or state 


Fig. 1.15 Star Schema for Sales 


In star schema, each dimension is shown by a single table, and each table 
keeps a set of attributes. For example, the /ocation dimension table contain 
the attribute set {/ocation_key, street,city, province_or_state, country}. This 
constraint may introduce some redundancy. For example, Gwalior and Indore 
are both cities in M.P. state of India. These types of entries in the location 
dimension table will create redundancy among the attributes province_or_state 


| and country, i.e., (..., Gwalior, M.P., India) and (..., Indore, M.P., India). In: 


sion table, the attributes may form either a hierarchy or a lattice. 


; . aE Data Warehousing 47 
gst Explain all steps in designing star Schema. (R.G py 

ans. The steps used in designing star schema are as follow. 
y (i) Identify a business process for analysis 
(ii) Identify measures or facts 

(iii) Identify dimensions for facts 

(iv) Use the columns that describe each dimension 

(vy) Determine the lowest level of Summary in a fact table 


June 2013) 


9.53. Discuss the most common performance improvement techni 
ysed in star schemas. ¥ 


ques 
e Generally, four techniques are used to optimize data 


(R.GPV., Dec, 2011) 
desig? — pa} i i 
(i) Normalizing Dimensional Tables — 
med to dimension tables to obtain semantic simp 
gation through the dimensions. In other word 
perations related to the dimensions by normali 


warehouse 


The normalization js 
licity and facilitate end- 
S, you simplify the data- 
zing the dimension tables, 
(ii) Maintaining Multiple Fact Tables Representing Diffe 
seqression Levels — The query operations can be speeded up by cane 
Be fact tables related to each level of aggregation. These aggregate bie: 
a precomputed at the data loading phase instead of run time. The goal of this 
ene is to save processor cycles at run time, hence speeding up data 
ee Anend user query tool optimized for decision analysis then properly 
ng the values by 


forme 
r navi 
filtering © 


accesses the summarized fact tables instead of computi 
accessing a lower level of detail fact level. 


(iii) Denormalizing Fact Tables — The access performance and storage 
space of data are improved by denormalizing fact tables. However, the ite 
space is becoming less of an issue. The costs of data storage decrease almost 
daily, and DBMS limitations that restrict database and table size limits record 
size limits, and the maximum number of records in a single table have es more 
effects than raw storage space costs. Denormalization improves performance 
by using a single record to store data that normally take many records. 


(iv) Partitioning and Replicating Tables — Partitioning divides a 
table into subsets of rows or columns and places the subsets close to the 
client computer to improve data access time. Replication makes a copy of the 
table and places it in a different location, also to improve access time. 


0.54. Explain snowflake schema. How does it differ from star schema ? 
Or 


Briefly describe the similarities and differences between star and 
snowflake schemas. (R.GP.V., June 2017) 


ET a calig on att as aki oe eke ed fa ae Ce he 


ee ES slightly differ trom Star schema in Whig 
Ans. sa oid normalized, hence further splitting the data ing 
some dimension 2 pes ee n difference between snow flake and star Scher, 
additional tables. ep raph representation is identical to a h, 
models. The snowflake schemi © 


schema model. $ model, the dimension tables may be kept ` 
In snowflake ered ses. Such a table saves storage ang Ge 
normalized form to remove siderable as compared to p” 


bini khouzh this saving is not con 
> Stenson nae Moreover, the snowflake structure Ce 
browsing, because more joins will be requireg ‘ 
system performance may be degraded. Becay, 


run a query. As 2 result, the | is not as popular as the star schema ; 


of this, the snowflake schema 
data warehouse design. 


Time Dimension Table 


Supplier 
Dimension Tabie 


Dimension Table 


supplier_key 
supplier_type 


City 
Dimension Table 


Location 
Dimension Table 


province_or_state 


Fig. 1.16 Snowflake Schema for Sales 
For example, consider a snowflake schema for sales as shown jy 
fig. 1.16. In snowflake model, the sales fact table is similar to the star schema 
© The two schemas are mainly differ in the definition of dimension tables. The 
normalization of single dimension table for item in the star schema is performeé 
) in the snowflake schema, providing two new tables — item and supplier tables 
Now, in snowflake schema, the item dimension table contains the attributes 
item key, item name, brand, type and supplier_key, in which supplier_key is 
connected to the supplier dimension table. This table contains supplier_ke 
> and supplier _type information. In the same way, normalization of single 
_ dimension table for location in the star schema results in two new tables - 
locat o n and city, The city_key in the new location table connects to the cif 


b a 
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Q 55. Differentiate between star schema and Snowflake schen j 
Lo 14 with 


n example. (R.GPV., Dec. 
the help of 4 D ec. 2004, June 2009, 2014) 
Compare the star schema and snowflake Schema. 
= (R.GPY,, Nov./Dec. 2007) 


Describe star schema and snowflake schema with examples 


(R.GRY, 

Ans. Refer to Q.51 and Q.54. Jute 2019 
0.56. Explain galaxy schema in detail, 
Or 


Write short note on fact constellation. (R.GPV., Dec 2004) 


Ans. Multiple fact tables are required by sophisticated applications to 
share dimension tables. This type of schema can be Seen as a group of stars 
and hence is known as a galaxy schema or a fact constellation. 


Sales Item i 
a: : ‘ Shipping Shi 
is ion Table Fact Table Dimension Table Fact Table ieee Te 
| brand |] shipper_key location_key 


dollars_sold 


supplier _type 


| from Tocation| | [Shippertype 
dollars cost 


units_shipped 


Location 


ch 
Bran Dimension Table 


Dimension Table 


location_key 


province or state 


Fig. 1.17 Fact Constellation Schema for Sales and Shipping 


For example, consider a galaxy schema for sales. This schema defines 
two fact tables, sales and shipping. In galaxy schema, the sales fact table 
definition is similar to the star schema. The shipping fact table contains five 
dimensions, or keys — item_key, time_key, shipper _key, Jrom_location, and 
to_location, and two measures — dollars_cost and units_shipped. Dimension 
tables are to be shared between fact tables in fact constellation schema. For 
example, the dimensions table for time, item and location are shared between 
both the sales and shipping fact tables. 
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50 e anà galaxy schema for multidimensiong) 


Q57. Danis star, sroneftak (RGRK, May 2019 


database. Or 
2 son Si S with the }, 
sonstellation schemes with the telp 
Compare star. ` pwftake and fact < (R.GRM, Dec. 2008) 
| = 3 : vi ? Draw schematic dia 
schemas ? Draws grams 
What à meant by datu warehouse SC (R-GPV., June 2014) 
ap > in multidimensional moq, 
: =» representations in M č i! 
Dascuss the various Schematic rep (R.GP¥., May 2019) 
r $4 and Q.56. 
fer to Q.51, QS4 an Zi . 
Ans Re ; FE galaxy-schemas for multidimensional 
Q.58. Explain snowflak (R.GP.V., Dec. 2011) 
databases 


Ans. Snowflake Schemas — Refer to Q.54. 

Galaxy Schemas — Refer to Q.56. Fe: 

0.59. Discuss the various ways of horizontal perenonine: 

Ans. The various ways in which fact data could be partitioned, before 


deciding on the optimum solution are as follows — 
nal Segments — The fact table is 
period represents a significant 


(i) Partitioning by Time into Equ 
partitioned on a time period basis, where each time 
i i ithin the business. 
as i the majority of the user queries are likely to be querying 
on month-to-date values, it is probably appropriate to partition into monthly 
> segments. If the query period is fortnight-to-date, consider partitioning into 
fortnightly segments as long as the total number of tables does not exceed 
something in the order of 500. 
Table partitions can be reused, by removing all the data in them. However, 
we have to take into account that a number of the partitions will store 
i transactions over a busy period in the business, and that the rest may be 
À substantially smaller. 

(ii) Partitioning by Time into Different-sized Segments — When 
aged data is accessed infrequently, it may be appropriate to partition the fact 
table into different-sized segments. Typically, this would be implemented asa 
set of small partitions for relatively current data, larger partitions for less 
active data and even larger partitions for inactive data. The advantages of 
using this technique are that all the detailed information remains available online, 
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hout having to resort to using aggregations. Als 
wi is kept relatively small, reducing Operating c 
tables "i is technique is that the partitioning profi 
T Using this method, a partitioning strategy th 
sis implies that data must be physically reparti 


month or at the very least at the start of each q 


usin 


monthly 


tioned at the start of the new 


uarter. 
(iii) Partitioning on A Different Dimension — 


: , cs Time- 
artitioning is the safest basis for Partitioning fact tables. because based 


wa ‘ 
different basis. There may be good reasons for 


region, supplier or any other dimension. 


For example, let us consider a marketing function 
distinct regional departments ~ for example, on a state 
region tends to query on information captured within i 
more effective to partition the fact table into regional pa 


that all the queries for that region are speeded up 
information that is not relevant. 


Clearly, the benefit of this style of partitionin 
queries regarding a region, regardless of the time perio 
is parti cularly appropriate where there is no definab! 
organization. 

(iv) Partitioning by Size of Table — In some data warehouses, there 
may not be a clear basis for partitioning the fact table on any dimension. In 
these instances, you should consider partitioning the fact table purely ona size 
basis — that is, when the table is about to exceed a predetermined size, a new 
table partition is created. This partitioning scheme is complex to manage and 
requires metadata to identify what data is stored in each partition. 

If we consider a customer event data warehouse in the retail banking 
area, we could find that the business operates a 7 x 24 operation, that is, there 
is no operational concept of the end of the business day, because transactions 


can occur at any point in time. It may be inappropriate to split customer 
transactions on a daily/weekly/monthly basis. 


annot be partitioned ona 
partitioning by product group, 


that is structured into 
-by-state basis. If each 
ts region, it is probably 
rtitions. This guarantees 
by not having to scan 


g is that it speeds up all 
d it covers. This technique 
€ active period within the 


(v) Partitioning Dimensions — In some cases, the dimension may 
contain such a large number of entries that it may need to be partitioned in the 
same way as a fact table. Put another way, we need to check the size of the 
dimension over the lifetime of the data warehouse. 
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tet us consider a large dimension that varies over time ir 
S store all the variations in order to apply comparisons 
S dimension table can substantial}, 


For example. 
jremMenkl Exists to 
patel may become quite large. A large 
affect query response times. 
(vi) Partitioning using R 
TS is res e a will be archived. It is possible t 
pauna S per ition prior to creating a new one and reuse the olg 
archive = ol l an Metadata is used in order to allow user acces, 
partition for the lates t table partition. The warehouse manager creates ą 
tools t refer to the = oh assales month_to_date or sales_last_week, which 
meaningful table Kis are physical partition. This technique also makes 
represents the data eee -ofthe table management facilities within the datą 
it simpler Sdn ee to refer to the same physical table partitions, 
The meus period they cover will change, but this can be managed by 
using appropriate metadata. 
Q.60. Discuss vertical partitioning in detail. 7 
Ans. In vertical partitioning, data is split vertically. This process is shown 
in fig. 1.18. This process can take two forms — normalization and row splitting, 
Normalization is a standard relational method of database organization. It allows 
common fields to be collapsed into single rows, thereby reducing space usage. 
For example, table 1.1 and 1.2 show a normalization process. In the data 
warehouse arena the approach tends to be other way. Large tables are often 
denormalized, even though this can lead to a lot of extra space being used, to 
avoid the overheads of joining the data during queries. This is particularly true 
of the fact data. 


ound-robin Partitions — Once the data 
complement of information, as a new 


5, Senior, ... 


Fig. 1.18 Vertical Partitioning 
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Table 1.1 Tables Before Normalization 


21-FEB-96 
21-JUN-96 


21-JUN-96 
21-JUN-96 
21-JUN-96 


Table 1.2 Tables After Normalization 


7 


2 
3 
2 
l 


2 
4 21-JUN-96 
7 22-JUN-96 
128 


Vertical partitioning can sometimes be used in a data warehouse to split 
less used column information out from a frequently accessed fact table. We 
distinguish row splitting from the normalization process, because it is performed 
for a different purpose. Row splitting also tends to leave a one-to-one map 
between the partitions, whereas normalization will leave a one-to-many mapping. 
The aim of row splitting is to speed access to the large table by reducing its 
size. The other data is still maintained and can be accessed separately. Before 
using a vertical partition you need to be very sure that there will be no 
requirements to perform major join operations between the two partitions. 
This sort of partitioning can be useful, for example, in situations where the 
split-out data is accessed only by drill-down operations. 


0.61. Write short note on data warehouse implementation. 


Ans. Data warehouses contain huge volumes of data. OLAP servers demand 
that decision support queries be answered in the order to seconds. Therefore, 
it is crucial for data warehouse systems to support highly efficient cube 
computation techniques, access methods and query processing techniques. 
The efficient implementation of data warehouse system support efficient 
computation of data cubes, indexing OLAP data and processing of OLAP 
queries. 
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Q.62. Discuss the cube compa 
i 3 
Or . 
Discuss the methods for efficient computation 


son technique for data wareho 


(RGPV., Dec. 290° 


003) 


of data cubes. 
(R.GP.V., June 2016) 


cient computation of aggregations across many sets of 
Ans. The BEN pe „dimensional data analysis. These aggregation, 
dimensions S at up-by can be shown by a cuboig 


Be L. Each gro À 
are known as goup-by s m SQ sce of cuboids resulting a data Cube 


5 7s creates a lattice l 
oF sie’ of soup bys a extends SQL so as to incorporate a compute 


aggregates overall subsets of the dimension, 
cube sappi ihat E laree numbers of dimensions, it requires More 
specified in the operatic y 


z sar the dimensions item. city, year, and sales_in_dollars to create, 
See ts data. Taking three attributes, item, city and year as the 
cons and sales in dollars as the measure, the total number of cuboids 
or group-by’s are 8 (2°). The possible group-by’s are ( ), (item), (city), (year), 
(item, city). (item. year), (city, () 0-D (A) Cuboid 
year) and (item, city, year). 
These group-by’s form a lattice 
of cuboids for the data cube as 
shown in fig. 1.19. The 0-D or (tem) reed 
cuboid is the case where Cuboid 
Set is empty. The 3-D GES 
or base cuboid contains ali three (City, Year 
dimensions. It can return the total ay 2D 
sales for any combination of the 
three dimensions. The apex 
cuboid is the most generalized 
(least specific) and the base 
cuboid is the least generalized 
{more specific) of the cuboids. 
An SQL query having no group-by, for example “compute the sum of 
sales”, is a zero-dimensional operation. An SQL query having one dimension, 
for example “compute the sum of sales, group-by year”, is a one-dimensional 
operation. On n dimensions, a cube operator is equivalent to a collection of 
group-by statements, one for each subset of the n dimensions. Thus, the n- 


dimensional generalization of the group-by operator is the cube operator. OLAP 
may require to access different cuboids for different queries. Hence, it is good 


3-D (B) Cuboid 
(Item, 


City, Year) 
Fig. 1.19 


to compute all or at least some of the cuboids in advance. This computation | 
results in fast response time and avoids some redundant computation. However, 


be 
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J. 


precomputation, the require storage space may explode 
< in a data cube are precomputed. The storage requi-rem 
essive when many of the dimensions have associated co 
h with multiple level. This problem js known as 


when all the 
ent are even 
Ncept hierar- 
the curse of 


jn this 
cuboid 
more exc 
eac 
crpensionalitY . 
The total number of cuboids for an n-dimensional data cube is ZA 
se hierarchies associated with each dimension. But, practica 
* mensions do have hierarchies. As an example, the dimension ti 
5 tored at only one conceptual level, such as year, but at 
exp is, such as in the hierarchy “day <Smonth<quarter 
as onal data cube, the total number of cuboids th 
ee ding the cuboids generated by climbing up the hi 


dimension) 1S i 
Total Number of Cuboids = [| | (L; +1) 
i=] 
where, Li denotes the number of levels associated with dimension i and one is 
added to L; in above equation to include the virtual top level, all. 


, If there 
lly many 
me is not 
multiple conceptual 
< year”. For an n- 
at can be generated 
erarchies along each 


0.63. What do you understand by partial materialization ? What is the 
significance of partial materialization as compared to full materialization 
of the data cube ? (R.GPV., June 2009) 

Ans. Partial materialization is the selective computation ofa proper subset 
of the whole set of possible cuboids. Alternatively, we may compute a subset 
ofthe cube, that stores only those cells that meet some user-specified criterion. 
To refer to the latter case, we use the term subcube in which only some of the 
cells may be precomputed for different cuboids. Partial materialization shows 
an intresting trade-off between storage space and response time. Following 
factors should be considered by the partial materialization of cuboids or 
subcubes — 

(i) Specify the subset of cuboids or subcubes to materialize. 

(ii) During query processing, exploit the materialized cuboids or 
subcubes. 

(iii) During load and refresh, efficiently update the materialized 
cuboids or subcubes. 

The selection of the subset of cuboids or subcubes to materialize should 
consider the queries in the workload, their accessing cost, and their frequencies. 
Besides, it should consider workload characteristics, the total storage 
requirements, the cost for incremental updates, and the broad context of 
physical database design. For cuboid and subcube selection, several OLAP 
products have used heuristic approaches. A popular one is to materialize the 
set of cuboids on which other frequently refrenced cuboids are based. 
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\kernatively, we can compute an joeberg cube. This is a data cube that record, 

any those cube cells whose aggregate value is above some minimum SUPpon 
u A wwe at ~ ~ 


reshol ch is to materialize a shell cube th 
shold. Another common approac 1 k. 
i i 


precompate the cuboids for only a small number 
F Aer materializing the selected cuboids. it is important to take advantage 
of them during query processing. This involves various issues, like how to 
use available index structures on the materialized cuboids. how to determing 
the relevant cuboids from among the candidate materialized cuboids, and hoy, 
form the OLAP operations onto the selected cuboids. At last, the 
RETO cuboids should be updated efficiently during load and refresh 
The significance of partial materialization as compared to the ful) 
materialization of the data cube is that full sean precompute a ts 
3 : g a data cube. This choice typically requires huge 
cuboids m EER NAE to store all of the precomputed cuboids 
storag f dimensions and size of associated concep, 


hierarchies i i f Dimensionali 
. This problem is referred to as the Curse o nality, 
ce ne selectively compute a subset of the cuboids 


=. Se in the lattice that requires less amount of storage space. 
0.64. Why most data warehouse system support index structures ? Discuss 
methods to index OLAP data. (R.GP.V., June 2015) 

Ans. Most data warehouse systems support index structures to facilitate 
efficient data accessing. 

Bitmap Indexing — This method is most popular in OLAP products due 
to fast searching in data cubes. It is an alternative representation of the 
record ID (RID) list. In this method for a given attribute, there is a distinct bit 
vector, | Bv, for each value v in the domain of the attribute. Total n bits are 
required for each entry in the bitmap index when the domain of a given attribute 
consists of n values. If the attribute has the value v for a given row in the data 
table, then the bit representing that value is set to 1 in the corresponding row 
of the bitmap index. Al! other bits for that row are set to 0. 


The bitmap indexing is useful for low-cardinality domains since join, 
comparison and aggregation operations are reduced to bit arithmetic, that 
significantly reduces the processing time. For higher-cardinality domains, the 
method can be adapted employing compression techniques. Bitmap indexing 
results in significant reduction in Input/Output and space because a string of 
characters are shown by a single bit. This method is advantageous compared 
to tree and hash indices. 


amount of 
especially when the number O 


Join Indexing — This method is popular from its use in relational database l 


query processing. In traditional indexing, the value is mapped to a given column 


to a list of rows having that value. However, join indexing registers the joinable í 
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f two relations from a relational database. For instance 
a M(MID, X)and N(NID, Y) that join on the attribute 
ey record contains the pair (MID, NID), where M 
fiers from the M and N relations, respectively. Hence, the join index 
A recognize joinable tuples without performing costly join operations Sea 
c The join indexing is useful for maintaining the re 
mary key and foreign keys, from the joinable r 
P ation ships between attribute values of a dimension 
in the fact table. Join indices may span multip 


' consider two 
s X and Y, then the 
ID and NID are record 


rows 0 


lationship between a 
elation. It Maintains 
and the corresponding 
le dimensions to form 
ify subcubes that are of 
jntere> 
attractive 10 ; ; i 

Eion ding dimension tables comprises the foreign key of the fact table 


and the primary key of the dimension table. 


0.65. What is data mart ? What are the types of data mart ? 


(R.GPV, J 
To une 2016) 


Write short note on data mart. 
Ans. Data Mart — Refer to Q.15 (i). 


Types of Data Mart — Data marts can be of two types, namely, independent 
and dependent, on the basis of source of data. The independent data mart is 
puilt directly from the legacy applications. An independent data mart is shown 
in fig. 1.20. Independent data marts are appealing due to they appear to be a 
direct solution to solving the information difficulty. An independent data mart 
may be built by a one department with no consideration to other departments 
or without any consideration to a centralized IT organization. When creating 
an independent data mart, there is no require to “think globally. A subset of the 
whole DSS requirements for an organization is represented by the independent 
data. To create, an independent data mart 
is a relatively inexpensive thing. An 
independent data mart permits an 
organization to take its own information 
destiny in its own hands. These are just a 
few of the causes why independent data 


(R.GP.V., June 201 7) 


marts are famous. Looking back at panes Data Mart 
diagram it can be seen that multidimen- 
; Fig. 1.20 Independent Data Marts 
sional technology strongly su 3 
BY Sty Suepests thaj Representation 


independent data marts be built. 


The architectural counter point to the independent data mart is the 
dependent data mart. A dependent data mart is shown in fig. 1.21. 


= Ewo 
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Data Marts 


Operational’ Data Warehouse 
Legacy Systems 
Fig. 1.21 Dependent Data Marts eri a... : 

i an independent data mart is a depen lent data mart, 4 
nae one which is created from data coming from the data 
warehouse. The dependent data mart does not depend on legacy or operationaj 
data for its source. For source data, the dependent data mart depends on only 
ouse. Forethought and investment are required by the dependent 


data wareh ; 
rae mart. Someone “think globally” is required by dependnet data mart. The 
dependent data mart needs many users to pool their information requires for 


the creation of the data warehouse. In other words, the dependent data Mart 
needs advance planning, a long terms perspective, global analysis, and 
cooperation and coordination of the definition of requirements among differen; 


departments of an organization. 
0.66. Differentiate between a data warehouse and a data mart. What iş 
the role of data mart in data warehousing ? (R.GP.V., Dec. 2003) 


Ans. Differentiation between Data Warehouse and Data Mart — A 
data warehouse gathers information about subjects that span the complete 
organization like customers, sales, items, assets and personnel. On the other 
hand, a data mart is a subset of the data warehouse that emphasizes on chosen 
subjects. The scope of data warehouse is enterprise-wide, while the scope of 
data mart is department-wide. Usually, the fact constellation schema is used 
for data warehouse because it can model multiple, interrelated subjects. In 
contrast, the star or snowflake schema is used for data marts because both 

are geared towards modeling single subjects. 

Data marts enable us to construct entire model by physically separating 
data segment within ‘the data warehouse. In order to avoid possible primary 
difficulties from the data warehouse, the detailed data can then be eliminated. 


Role of Data Mart in Data Warehousing — The current trend is toi 
ethe data warehouse as a conceptual environment. The industry is moving, 
single, physical data warehouse toward a set of smaller, more) 
manageable, databases known as data marts and partitions of the overall data 


a 


from: a 
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e are also known as data marts. If we visua 


ze the arehone 
ee ang every aspect of a company’s business (sales, Oraa 
as COV forth), then a data mart is a subset of that large data re a 
and AY for a department. Data marts may contain some overlappin i t 
spe?! pysical data marts together serve as the conceptual data ti de 5 
pa marts must provide the easiest possible access to information asa 
a ya user community. 


0.6 7, Explain in brief distributed data marts. 
A distributed data mart is 


Ans. Multitier 
i epa 
a conglomeration of sep Ee we 
components which are connected | Warehouse Distributed 


Data 


gh a network. The aim is to 


u ; 
oe these separate components Marts, 
a a 
appeat as a single global data Paterna 

house image. A distributed data Data 
ware Warehouse 


ehouse, the nucleus of all 


r 
V odse data, sends relevant data R pile 
“individual data marts from which amo 
0 


users can access information for 

rder management, customer 
T sales analysis, and other 
reporting and analytic functions. In 
modern era of economic changes, market techniques of information dimensions 
changes globally and locally. Each and every company has its information of 
business and other things on internet and by this facility its decision power 
become simple and on right time. Each company stored their information in a 
database so they needed a data warehouse. All organization spending a larger 
sum of money in this technology so that they can gain first in this competitive 
environment of productivity. Data warehouse is a set of materialized views 
over data sources. Ralph Kimball et al defined “A data ware- house is a copy 
of transaction data specially structured for query and analysis”. But because 
of competitiveness, of market all enterprises has thinks on a larger platform 
and then acted on it in their enterprises. For this need of changes in data 
warehouse required and distributed data warehouse fulfill these requirements 
perfectly. 


Define a High-level Corporate Data Model 


Fig. 1.22 Representation of Distributed 
Data Marts 


0.68. What are the features of distributed data warehouses/marts ? 


Ans. There are several features of distributed data warehouses/marts as 
follows — 

(i) The data copied into a data warehouse does not change. The 

data warehouse is a historical record of the state of an organization. The 


pcarpecnr ere yen Be ANCE a e ERT Se 
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of the source OLTP systems are reflected in the 
new data, not by changing existing data. 
arehouses are subject oriented, that is, they focys 
tory and quality. OLTP systems, by cont, 
perations like order fulfillment. 


frequent changes it 
warehouse by adding 

(i) Data w 
measuring entities. like sales, inven 


are function oriented and focus on 0} lil 
(mi) Indata warehouses, data from distinct function-oriented System, 


tegrated i jew ional entity. 
` to give a single view of an operational er 

is in (iv) Data warehouses are designed for business users not database 
programmers, SO they are easy to understand and query. 


0.69. What are the disadvantages of distributed data marts ? 


Ans. There are considerable disadvantages involved in moving data from 


; ‘ohly di „data sources to one data warehouse that transla 
canta. high cost, lack of flexibility, dated ‘aorta 
and limited capabilities — 

(i) Major data schema trans 
one schema in the data warehouse, which can 
total data warehouse effort. 

Gi) Data owners lose control over th 
security and privacy issues. 

(iii) Long initial implementation time and associated high cost. 

(iv) Adding new data sources takes time and associated high cost, 

(v) Limited flexibility of use and types of users — requires many 
separate data marts for many uses and types of users. 

(vi) Usually, data is static and dated. 

(vii) Usually, no data drill-down capabilities. 

(viii)Difficult to accommodate changes in data types and ranges, 
data source schema, indexes and queries. 

(ix) Usually, cannot actively monitor changes in data. 


0.70. Explain framework for distributed data warehouses. ; 
Ans, Two methods to create the distributed data warehouses are available 


which are described as follows ~ 
(i) Inmon’s method (ii) White’s method. 


(i) Inmon’s Method — This method considers that the existence of 
both local and global data warehouses with data stored in each being mutually 
exclusive as shown in fig. 1.23. The local data warehouse contains the data of 
interest to the local site and involves historical data in addition to local decision 
making functions. The global data warehouse contains data common across 


On 
ast, 


forms from each of the data sources tg 
represent more than 50% of the 


eir data, raising ownership 
s 


the corporation and data integrated from the various local staging arcas for which is the combination of both centralized data warehouses and a 


> <a 
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nto the central position. This is accom 


clusion ! lish z 

tae? data warehouse data before passing it to Pb a a local 
warehouse which gives the global decision support gen PD data 
corporate-wide queries. This method considers that data found 5 nality for 
data warehouse are not stored in the global data warehouse and oie, local 
thereby guaranteeing no redundancy between them. Inmon’s Sein re 


| exclusivity of data between the loca 


a 
PO l and global data warehouses, 


seems to be impractical. 


Europe 
Africa 


[see] 


Local 
Operational 


Local 


COA Local 


a Data 


Warehouse 
Mutually 
Exclusive 


Operational 


Global Data 
Warehouse Global Data 
Warehouse 
Asia 


Global Data 
Warehouse 


Global Data 
Warehouse 


Fig. 1.23 Representation of Inmon’s Method 
(ii) White’s Method -— This is known as a“Two-tier data warehouse”, 


O ce j 
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decentralized data mart. White's central data warehouse contains normalizeg 
detailed data capture and cleaned from operational systems at user-defingg 
intervals. The central data warehouse maintains data collections which consists 
of data derived from the detailed base data. Data collections are the user viey, 
ofdata warehouse data and may contain denormalized detailed data as wel] as 
summarized data. A data distribution service is provided by the data warehoyg, 


to distribute data coll 
or sites of the corporatio 
other sites of the corporat! 
systems, which improves both 


ections to decentralized data marts at the multiple branche, 
n. The data marts are sub sequently distributed to the 
on. Data marts allow DSS processing on | ocal 
performance and availability. 


Local 
Operational 
Processing 


Local 
Operational 
Processing 


Data Mart 


Centralized 
Data 
Warehouse 


Data Mart 
Fig. 1.24 Representation of White’s Method 


0.71. Explain in brief about distributed database designs. 


Ans. In distributed data environment, two approaches for distributed 
database design were introduced — the top-down approach and the bottom-up 
approach. The top-down approach is used if the databases are non-existent. 
Although, once the databases exist, the bottom-up design is the suitable approach. 

(i) Top-down Design — In the top-down design approach, the data 
warehouse is built first. The data marts are then created from the data 
warehouse. This design is a process of creating data models which contains 


high-level entities and relationships, to which successive refinements are applied. data that define warehouse o 


i Data Wareh 
der to identify the corresponding low-level] M 


m is. The top-down approach is E aes relationships and 

-ty-relationship model. 7 using the concepts of 
(a) Analyzing requirements 
(b) View integration and conceptual d 
(c) Data distribution design pai 
(d) Local physical schema design 

(ii) Bottom-up Approach — Bottom-y 
the objective of the design is to integrate existin 
up design p ie ee local concep 
rocess is inte g local schemas into 
Fa K ost important aspects of design hee a eee 
multiple database system together. Implementation altern ‘ts athe oe 
according t0 the autonomy, distribution, and heterogeneity a Eeee 
systems. 


(a) Selecting a common databas 
. . © m ibi 
schema of the existing databases. eee “essnbingii global 


(b) Translating each local schema into the comm 


(c) Integrating local schemas form the e 
the global conceptual schema. i 


Begins iS appropriate when 
8 database systems. The bottom- 
tual schemas and the objective of 


on data model, 
isting databases into 


"META DATA, EXAMPLE OF A MULTIDIME 


_ MODEL, INTRODUCTION TO PATTERN NSIONAL DATA 


WAREHOUSING 


0.72. Discuss the role of metadata in data warehouse. 


(R.GP.V., Dec. 200 
Ans. There is a very different role of metadata as compared to other d : 
er data 


warehouse data. Metadata are important for number of reasons. F 

metadata are used as a guide to the mapping of data Bee ee example, 

transformed from the operational environment to the data ? 3 bec 

directory to help the decision support system analyst to fotat a gy a 3 

the data warehouse, and as a guide to the algorithms used for pa Misa of 

between the current detailed data and the lightly summarized data a A 
> wee 


the lightly summarized data and the high 
ly summarized date ash 
eae ed ghly marized data. Metadata should 


_ 273. Explain metadata and its types. 


fi Or 
1; rite short note on metadata repository, 


(R-GP.V., Dec, 2018) 


(R.GPV, June 2017 


E Ans. Metadata are data al j | 
adata are data about data. In data warehouse. metada f; 


ject 


OCS, Metadata reposHory lje 


y Dye DOGON Hg 
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Metadata is a bridge between the data 


warehouse and the decision support application. Metadata are created for th 
data names and definitions of the given warehouse. Metadata are required a 
offer an unambiguous interpretation. Metadata give a catalogue of data in the 
data warehouse and the pointers to this data. Besides, metadata may Contain 
data extraction transformation history, data communication/modelling 
algonthms, data warehouse table sizes, usage statistics and column aliases 
Metadata are also used to describe many aspects of the applications. 

broad categories of metadata, depending on how it is useq_ 


of the data warehouse architecture. 


There are three 
(i) Build-time Metadata — Whenever we design and build a 
warehouse, the metadata that we generate can be termed as build-time meta data 


ess and warehouse terminology and describes the 
s most detailed and exact type of metadata and is 
use designers, developers and administrators. It jg 
the metadata used in the warehouse. 


This metadata links busin 
data technical structure. It i 
used extensively by wareho 
the primary source of most of 
data — When the warehouse is in production, usage 
metadata, which is derived from build-time metadata, is an important tool for 
users and data administrators. This metadata is used differently from build- 


time metadata and its structure must accommodate this fact. 


(iii) Control Metadata — The third way metadata is used is, of course, 
by the databases and other tools to manage their own operations. For example, a 
DBMS builds an internal representation of the database catalogue for use as a 
working copy from the build-time catalogue. This representation functions as 
control metadata. Most control metadata is of interest only to systems 
programmers, However, one subset which is generated and used by the tools that 
populate the warehouse, is of considerable interest to users and data warehouse 
administrators. It provides vital information about the timeliness of warehouse 
data and helps users track the sequence and timing of warehouse events. 


(ii) Usage Meta 


0.74, Explain multidimensional data models briefly. 
(R.GP.V., June 2011) 


Ans. A multidimensional data model is a popular model that influences data 
warehouse architecture. This model views data in the form of data cube or 
more precisely, a hypercube. Data can be modeled and viewed in multiple 
dimensions by using data cube. A data cube is defined by dimensions and facts. 

Dimensions are the entities with respect to which an organization wants 
to keep records. For example, a sales data warehouse is created to keep records 
of the store’s sales along with the dimensions item, time, branch and location. 
These dimensions enable the store to keep record of data such as monthly - 
sales, branches, and locations, where these items were sold. Each se: 


f 


B ae 
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able associated wi h It, known as a ji i 
as ¢ dimension tabl. 
t e th 


able for ite 
ntes item_name, type and brand. Based on ae 
ples are specified by users or automatically created. Fa 
a asus that are considered as the quantities by which rel 

me ions are analyzed. For example, a sales data w 
ars_sold, units_sold and so on. A multidimensional 
central table known as fact table. This table 


or measures, as well as keys to each of the relate 


at ag 
a zal 
m may hay sie 
st ution, dimension 
cts are numerical 
ationships between 
arehouse include the 
data model is Organized 
ee the names of 
i 

f isare think as 3D geometric structures, however p: Gan 2 
on aie n-dimensional. Consider a simple 2D dag cub 
table oF spreadsheet for sales data to better understand 
or dimensional model. Table 1.3 shows the sales data fo 
on arith city Gwalior. The fact or measure displayed i 
eis In this 2-D representation, the sales for Gwalior ar 


with the dimension time and item. 
Table 1.3 A 2-D View of Sales Data 


ta warehousing 
e, i.e., in fact, a 
data cubes and 
r items sold per 
s dollars_sold in 
e displayed along 


Location = “Gwalior” 


ET 


120 980 92 318 
420 784 43 416 
314 576 71 572 
218 800 110 388 


Now, consider a case where we would like to view the sales data with a 
third dimension. Let, we would like to view the data with dimension time. it 
and location for the cities New York, Chicago, Agra and Gwalior. This 3D soe 
is represented as a series of 2D tables as shown in table 1.4. Fig: 1.25 shows 


the same data in the form of 3D data cube. 
Table 1.4 A 3-D View of Sales Data 


Location = “New York”| Location = “Chicago” | Location = “Agra” |Location = “Gwalior” 


a a ee 
eee rc [tv cp Gili cp Gliv ¢ PF 4 
616 872 150 400 | 1100 500 550 1300] 750 10001200 900] 120 980 92 
zs 724 575 650 | 550 800 775 700 |450 875 640 300] 420 784 43 
eg 788 475 | 500 425 800 600 |882 325 550 450/314 576 71 572 
990 582 | 1050 282 950 1100 250 675 400] 218 800 


€ = Computer, P = Phone, G = Game 


FITS TT a a ae eee 
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Time (quarters) 


Tv Computer Phone Game 
Items (ypes) 


Fig. 1.25 A 3-D Data Cube of Sales Data 


Now. consider that we would like to view our sales data with a foun, 
dimension, supplier. The 4D cube can be represented as a series of 3D Cube; 
This 4D data is depicted in fig. 1.26. In the same way, we can represent any 
n-D data as a series of (n — 1)-D “cubes”. The actual physical storage of Such 
data may differ from its logical representation. The data cube is a metaphy 


for multidimensional data storage. 


`] 


& 


TY Computer Phone Game TV Computer Phone Game 


Item (types) 


TV Computer Phone Game 
Item (ypes) 


Fig. 1.26 A 4-D Data Cube of Sales Data 
(R.GP.V., June 2008, 


Item (types) 


0.75. Discuss data cube technology briefly. 
Or ` 


Describe the term data cube. (R.GEV., Dec. 2010, June 2014) 


Ans. Refer to Q.74. 


0.76. Discuss the features of the multidimensional model. 


roe 


Ans. Some features of the multidimensional model are as follows - 


> ae 


(R.GP.V., June 2013) 


i 
l 
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(i) The multidimensional model promotes co 
maintained as separate workspace objects and are 
(ii) The multidimensional model enforces re 


„n member is unique. If a measure has three 
that measures must be qualified by a mem 


PA, Dimensions 
shared by measures, 


ata. Each dimension 


fa nth 

asa de ues e order the 

se 4, The default status list is always the same unless it is altered SH are 
s o order of dimension members is consistent and known, the gue ause 
pen can be relative. 10n of 
m 


(iv) The multidimensional model presents data as fu 

do not need to define calculations. Because of the ae 
wer and casgiof Use of the OLAP DML, the analytic workspace peN 
i repared so that the data is presented as fully solved to the application 


0. 77, Explain in brief about multidimensional view, 


Ans. The multidimensional view of data is in some ways natural view of 
any enterprise of managers. In fig. 1.27 the triangle diagram shows that as we 
go higher in the triangle hierarchy the managers need for detailed information 
declines. The multidimensional view of data by using an example of a simple 
OLTP database consists of the three tables. Much of the literature on OLAP 
uses examples of a shoe store selling different color shoes of different styles 
itshould be noted that the relation enrolment would normally not be required 
since the degree a student is enrolled in could be included in the relation student 
put some students are enrolled in double degrees and so the relation between 
the student and the degree is multifold and hence the need for the relation 


enrolment. 
student (Student_id, Student_name, Country, DOB, Address) 


enrolment (Student_id, Degree_id, SSemester) 
degree (Degree_id, Degree_name, Degree_length, Fee, Department) 


Senior 
Executive 
V-C, Deans 


Department & Faculty 
Management, Heads 


Daily Operations Registrar, 
HR, Finance 


Fig. 1.27 A Typical University Management Hierarchy 
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An example of the first relanon, i.e. student, 1s 
Table 1.5 Relational Enrolment 


given in table 1.5 


1/1/1980 Davis Hall 
2/2/1981 9 Davis Hal] 
3/3/1983 

4/4/1983 

5/5/1984 

6/6/1980 


8656789 | Peta Williams 
8700020 | John Smith 
8900020 | Arun Krishna 
8801234 | Peter Chew 
8654321 | Reena Rani 
8712374 | Kathy Garcia 
8612345 | Chris Watanable 7/7/1981 11 Main street 
8744223 | Lars Anderssen 8/8/1982 Null 

8977665 | Sachin Singh 9/9/1983 Null 

9234567 | Rahul Kumar 10/10/1984 | Null 

9176543 | Saurav Gupta 11/11/1985 | 1, Captain Drive 


Table 1.6 presents an example of the relation enrolment. In this table, th. 
attribute SSemester in the semester in which the student started the Curren 
degree. We code it by using the year followed by 01 for the first semester ang 
02 for the second. We assume that new students are admitted in each semester 
Table 1.7 is an example of the relation degree. In this table, the degree length 
is given in terms of the number of semester it normally takes to finish it. The 
fee is assumed to be in thousands of dollars per year. 


Table 1.6 The Relation Enrolment 


8900020 4444 2000-01 


pepe 


3ANO-OI 
PASS aoe: 


1999-02 


1999-02 
2000-02 
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Table 1.7 The Relation a ousing 69 


T ree_id Degree_name Degree length 


it is clear that the information given in tables 1.5 


gitable for a student enrolment OLTP system, is not suitable for effici 

management decision making. The managers do not fel eat efficient 
the individual students, the degree they are enrolled aaa ee ation about 
‘ined the university. What the managers need is the trends in aes they 
in different degree programs and from different countries. We fed numbers 
only two dimensions. Let us say we are primarily interested in rane consider 
many students from each country came to do a particular de = Foe how 
we may visualize the data as two-dimensional, i.e., country ie . Therefore 
that summarizes this type of information may be r Guat nee 
dimensional spreadsheet given in table 1.8 (the numbers in table | a two 
need relate to the numbers in table 1.7. We may call that this abie ROO ie 
number of students admitted (in say, 2000-01) a two-dimensional FENES the 


, 1.6 and Lp: although 


cube”, 


Table 1.8 A Two-dimensional Table of Aggregates for 
Semester 2000-01 


LLB | MBBS | BCom 


Country/ 


Sa 

w 

A 

“4 

> t 
s Y L 
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jew we are able to find the number of stug 


Seino NN 4 i NOT aly 
Using this wo-din ae for semester 2000-01). Other qu 


joining any degree from any country (only 
that we are quickly able to answer are — me ‘ 
How many students started BIT in 2000-01 ? 
© y > ; 
How many students joined from Singapore in 2000-01 ? 
© Í `. Aina ~ 


Ht, 
Teg 


e is for a particular semester, 2000-01. A simi 


lar 
ters. Let us assume that the data 


for 


X 18 

The data given in table sig 

table would be available for other semes 
2001-02 is given in table 1.9. 


ble 1.9 A Two-dimensional Table of Aggregates for Semester 2001.9) 
Table 1.9 / 


Let us now imagine that table 1.9 is put on top of table 1.8. We now have 
a three dimensional cube with SSemester as the vertical dimension. We now 
put on top of these two tables another table that gives the vertical sums, aş 


shown in table 1.10. 
Table 1.10 Two-dimensional Table of Aggregates for Both Semesters 


Fn os [acm [| 
30 31 21 


30 
43 
54 
14 
24 
31 


0 
2 


— 
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tables 1.8, S nie peck now form a three-dimensiona| ub 
„table 1.10 provides totals for the two semesters and we are able t Cube 
The ® to find numbers in individual semesters, Note that a cub a drill- 
A have an equal number of members in each dimension E not 
he tables together gives a cube of 8 x 6 x 3 (= 144) cells including ahaa 
Ta every dimension. 3 


k sbe could be represented by — 
Country x Degree x Semester 


2000-01 


Degree 


Fig. 1.28 The Cube Formed by Tables 1.8, 1.9 and 1. 10 


0.78. What is the concept of hierarchy ? How is it related to Web mining ? 


bi i (R.GRV, Dec. 2004) 
í Ans. A sequence of mappings from a set of low-level concepts to higher- 
level concepts is known as concept hierarchy. In other words, we can say 
that concept hierarchies are used to reduce the data by collecting and replacing 
low level concepts with higher level concepts. For example, a concept hierarchy 
forthe dimension location is shown in fig. 1.29. For the dimension location, 
the city attributes are Chicago, New York, Gwalior and Agra. The mapping of 
each city can be done to the state to which it belongs. For example, the 
mapping of Gwalior can be done to the MP. In essence, the mapping of 
provinces or States can be done to the country to which it belongs. For example, 

_ MP can be mapped to India. These mappings form a concept hierarchy for the 


Le TMI 


BR nace 
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of this approach is a hiera i 
. Z : > at of low-level concepts to higher-leve outcome arch 
z at apning a set of low-level concep g evel, y The out 
danceSun focakron OSAP NOr there is 2 set of documents having a common conc 
general concepts. nodes de. The hierarchy of documents resultin Beare 
ihe nOCe y text mining process. It is assumed © tagging process i 
ed that the hierarchy of 


Lecation 
yselu! for ies n a priori. We can even h 
f; NO ; av 
ots is € such a hierarchy of document 
s 
tering algorithm 


ta concept hierarchy by using any hierarchica] Ke 
ults in such a hierarchy. s 
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once 
withou 
which E5 $ 
0. 79, Briefly explain the concept of pattern warehouse. 


Ans. Pattern warehouse is a kind of Tepository which Stores the re] 
relevant 


; istent manner. Patterns are th 
rsin a Persis € representatiy 
ma within data warehouses. € ofthe knowledge 
‘ lains the process fl 3 
Fig. 1.30 exp p ow of Populating Patterns within patterns 
3 warehouse. 
Fig. 1.29 A Concept Hierarchy for the Dimension Location 
For a given attribute or dimension, concept hierarchies may be define | 
by discretizing or grouping values, resulting in a set_grouping hierarchy. For, Flat Files CY | Dan — > 
be more than one concept hierarchy o > Warehouse Pattern 
n i Preprocessing Data Mining Warehouse 


given dimension or attribute, there can re 
r viewpoints. In data mining system, concept hierarchie! 


the basis of various use: S c 

may be predefined if they are common to many applications. In data mining 
i i ibili ilor predefi 

systems, users are also provided with the flexibility to tai p ined 

ie ethics according to ce particular needs. Concept hierarchies alloy, ‘Here, fig. 1.30 clearly explains that the data from flat files and discrete 

Js of abstraction. databases are first pre-processed and then stored in data warehouses over 

which data mining operation are performed which mine relevant patterns and 


data to be handled at different leve 
i isti i data distribution, co EF } 
poe bare oian a o ~ A store them in pattern warehouse in a non-volatile manner. 


i i í enerated or it may be provided manually } ) 
a 5 knowledge E For a user or a domal This er addresses the following issues related to data warehouse — 
expert, manual definition of a concept hierarchy can be a tedious and time _ ++ (i) The size of single data warehouse was quite large spans upto TBs 
consuming task. Fortunately several discretization method can be used tl so it became a tedious task to handle the management of data warehouse for 
automatically generate or dynamically refine concept hierarchy for numerical small organizations. 
attribute. Furthermore, many hierarchies for categorical attributes are implici (ii) As the size of the data warehouse was huge, it means it contains 
within the database schema and can be automatically defined at the schem [oof data, but for analysis purpose business analyst demands the consolidated 

information for analysis purpose. 


definition level. 
Relation of Concept of Hierarchy to Web Mining — Tagging a documen (iii) As patterns are in volatile form in data warehouse, so for even a 
small analysis the whole process of data mining has to be performed for 


with a concept implicitly entails its tagging with all the ancestors of the concep small 
hierarchy. Therefore, it is desired that a document should be tagged with th: obtaining certain results. 


i i t to fees 
lowest concepts possible. The method to automatically tag the document | D Give compari 
the hierarchy is a top-down approach. An evaluation function ae wie is ‘ parison among database, data warehouse and pattern 
whether a document currently tagged to a node can also be tagged to any ® ea i , . 
s child nodes. , then th moves down the hierarchy till it cannot be _ Ans. “the Comparison among database, data warehouse and pattern 
a aan Gi nn eis | e@ehouse IS given in table 1.11. 


any further. | 


Fig. 1.30 Process Flow for Pattern Warehouse 
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Table 1.11 


carare 


z onal Analytical process 
Characteristics | Operational joformal: p a 
| processing processing 
Current Historical 


Relational in form of 
patterns 
Consolidated 
Relational 

Hundreds 


Daia 


Highly detailed | C onsolidated 
Flat relational | Multidimensional 
Millions 


Summarization 
| View 

Number of 
records 


Hundreds Few (only administrator) 


100 MB to GB} 100 GB to TB |Few GB 


PMQL 


QUERIES, Typ 
onsere. S 


Ql. What is OLAP and what are its main characteristics ? 


; Ă (R.GP.V, Dec. 2011) 
Ans. OLAP (on-line analytical processing) is an anal 


sing ysis technique with 
fanctio nalities dike summarization, consolidation, and aggregation as Well as 
he ability 10 View information from various angle. OLAP is mainly all about 
peing able to access live data online and analyze it. This System manages huge 
amouat of historical data, and stores and manage information at various level 
of gran ularity. OLAP data are stored on multiple Storage media because of 
their huge volume. These features make the data easier to use and in informed 
decision making. It also deals with information that originate from several 
organization, integrating information from any data store. An OLAP system 
yse either star or snowflake model and a subject oriented database design. 
OLAP is performed on data warehouse or data marts using the multidimensional 
data model. 


Characteristics of OLAP— The OLAP systems have FASMI characteristics, 
the name derived from the first letter of the characteristics, which are — 


(i) Fast- As we know that, most OLAP queries should be answered 
very quickly, perhaps within seconds. The performance of an OLAP system 
has to be like that of a search engine. If the response consumes more than say 
20 seconds, the user is likely to move away to something else considering 
there is a problem with the query. Achieving such performance is difficult. 


(ii) Analytic — Rich analytic functionality must be provided by an 
OLAP system and it is expected that most OLAP queries can be solved without 


programming. The system should be able to cope with any relevant queries for 
the application and the user. 


~ (iii) Shared — An OLAP system is a shared resource. However, it is 
unlikely to be shared by hundreds of users. An OLAP system may be used by 


i 
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LAP data in a multidimens 
i Storing O nsional data w 
1 is likely to be accessed only by a select or (i) 
s of şs and i itken x 
mere dorens Of users 3 


STOUD af atural form for most OLAP use. However e This 
R à ah Q tn : , : ise 
AP system should provide aq : Mos" eds more work compared to other anil into 
s. Being a shared system, an Soar È CqQuaty | jg the , needs more : 3 approaches, and it does not 
managers. Bemg a s lity M well as mtegnty. his oP et sufficient performance with huge data warehouse dalha 
ity fi fidentiality as as eng : | | do 
secunty for oontiden NE I- The OLAP sottware must give a mult awoys ° 
i idimensional — i 
üv) Mutndim 


ij) Storing OLAP in a relational d 


atabase, which ca 
| views of its contents. The user 


` z n generate 
| view of the data. It is due to the multidimension, 
epiual Vie 
dimensional concepta 


Cannot tell thi 
S approach fr 
EEREN A dimension often Contains pulidimen one. Here, the work of creating the dimensional view js PR 
view of data that eee hild relationships between the members o¢ al ie pr request is received. This approach is referred to as relational OLAP 

` `S en : : | : 7 
hierarchies that Se vec deonkdibealloved by the multidimensional Structure, f when th (itt) Storing OLAP data in a conventional database a 

sion, Such hierarchies s 

dimension. Suc 


gitin j 
yt ke info matio A 

AP s > Ss US 

(G N 


pm to a program parea provide users with a dimensional view This | 
Suen should be able to handle a large amount of inpy Mat z similar to the pivot table function of Microsoft Excel. It is the largely 
ai mOLAPs ter ation | OPE 3 ultidimensional data analysis product, despite being awkwardly grafted 
data. The capacity ofan O av re critical Cnt dimensional spreadsheet foundation, simply because so E, 
; r se may . a 
with the data warehous mo | 
cae a ee ves i LAP as follows — | p OLAP software is designed to work in client/server mode. The data 
ee e | use database S ouone computer, and the user sits at another that provides 
Und a A N  E e tha seed interface. Whether the majority of the analysis work is performed on 
(i) Unde p ding a number of channels for selling the product, | the on tor the server varies with the product and, for some node aan 
cts and uses pi 
has many produ in finding the most popular products and the most Popula. th E are set up: 
OLAP can assist in finding be possible to find the most profitable customer, | way RE ieee 
hann cases it may ‘ome Dy 
channels. In some 4 ran the felcsemarganeations industry and only considering | Sik eae Sete eee Creates 
for example, consi SME tions minutes, there is a large amount of data if a! D MDDB eae limer 
one product, communicatio - 


hour of the day (24 
les of product for every y (24 
Np a Ee E ia and weekends (2 values) and divid. | 
urs), 

regions to which calls are made into 50 regions. el 
= ding and Reducing Costs of doing usiness | : | 
(ii) Understan T f improving a business, the other aspect is to| 0.4. Explain the processing of OLAP queries. 

l analyze co and contol tem as much SD r sale Ans. The processing of OLAP query should proceed as follows — 
analyze costs an tO ing the costs associated with sales. In some cases, i a mar AEE ee a 
é ya sh wider 5 e vccminl ji 4 is involves transforming any selection, projection, drill-up (group- 
may also be possible to aaas S ; top salesperson may involve significan! Coat eee ee ey set aac ci 

; dat Pie. 
ment (ROJ). For examp cia by the salesperson may justify the investment, Me = EM anc sic ae ae aL 
Gh sit 0 hitecetre with the help of sehemati correspond to selection and/or projection operations on a materialized cuboid. 
re architecture e 
Q.3. Explain OLAP softwa 


(EGEK June ano T (ii) Determine to which AR A he HCE 
ro i bilities’ Opera to be Aapplied — All the materialized cuboids are identified by 
; and for OLAP capabilities! Operations are pp ; 
Ans. During the last half of the ee to capture a shaft jt, that are used to answer the query, pruning the above set using sta of 
aria a appre hed it from their existing products! “dominance” relationships among the cuboids, determining the ar using 
Siis a Sassi ites files ieserbied stagecoaches ifa stagecoadl eremaining materialized cuboids and choosing the cuboid with the minimum 
‘perspective, much as early automobi à i 
Suk i built them or heavy bicycles with more wheels if they were H 4 
aide factory. Fig. 2.1 shows three approaches supported by contemp 
software packages. These approaches are as follows — 


Analysis User 


Interface 


Program 


Fig. 2.1 OLAP Software Architecture 


front-end multidimensional queries are mapped directly to server 
storage structures because the storage model of a MOLAP server is an n- 


ee ee 
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‘ Be dire ssing capabilities, The stre 
dimensional array, that offer direct addressing pe Siika "ehtfory, 
on of the data cube has poor storage utilization wi N 


armay representat à EA SE en 
data are sparse, but has good indexing properties. Therefore, sparse and = 
compression techniques are adopted for efficient storage and procesgin g t 


and dense array storage structures may differ, Makin 


asi tenet roach to MOLAP query Processing 3 It 


advantageous to use a two-lev el app 


sparse matrix structures for sparse arrays and array structures for dense aye 
a o-dimensional dense arrays can be indexed by B-trees. In MOLAp a 
S > e 


dense one and two-dimensional arrays must first be recognized. Then, indice 
are make to these armays using traditional indexing structures. The util 
of storage is increased by two-level approach without sacrificing 
addressing capabilities. i 
To better understand the OLAP query processing, consider that we defin 
a data cube for sales like “sales_cube [item, time, location] : Sup 
(sales_in_dollars)”. The dimension hierarchies used for item “item_name a 
brand < type”, for time “day < month < quarter < year’, and for location “Stree! 
< city < province_or_state < country”. | 
Consider that the query to be processed is on {brand, province_or_state | 
with the selection constant “year = 2009”. Also, consider that there are four | 
materialized cuboids as follows — | 
cuboid 1 : {year, brand, country} | 
cuboid 2 : {year, item_name, city} 
cuboid 3 : {year, brand, province_or_state} 
cuboid 4: {item_name, province_or_state} where year = 2009 
Cuboid | cannot be used since country is a more general concept a 
comparison to the province_or_state. Cuboids 2, 3 and 4 can be used for query| 
processing because of the following reasons — 
(i) The selection clause in the query can imply the selection in the cuboid, 
(ii) The abstraction levels for the item and location dimensions ig, 
these cuboids are at a finer level as compared to brand and province_or_state, 


respectively, | 
Gii) They have a same set or a superset of the dimensions in the query, 

In the query, cuboid 2 would cost the most since both item_name and city 
are at a lower level as compared to the brand and province_or_state. If ther 
are not many year values associated with items in the cube, but there are several 
item names for each brand, then cuboid 3 will be smaller than cuboid 4, ani 
hence cuboid 3 should be selected for query processing. Although, cuboid 4 
may be good choice if efficient indices are available for it. Thus, to decide) 
which set of cuboids should be chosen for query processing, some cost basel 


estimation is needed. 


ization | 


what are the different types of OLAP Servers } 


Qó s 5 Explain them, 
Or ( .GP V, Dec. 2003) 
Dote on OLAP servers. (R.GRY, Dee, 2094 N 
o 


with multidimensional data, OLAP servers r , 


: de bus; 
ah warehouses OF data marts, without concerns about ce users 
from mare record. The following servers are used for the Tae re and how 
pe d houses server for OLAP processing — mentation of 
re 
awa 


(i) Multidimensional OLAP (MOLAP 


: ) Servers — 
yltidimensional views of data through array based aca Servers 
ae e€nsional 


sional views directly to data 
a data cube is that it permits 


y e 
jevel storage representation to handle dense and sparse data a ee 
- Densor 


identified and record as array structures, While sparse subcub 
e compression technology for efficient storage utilization a 


cud” 
fast 19 


ROLAP servers are the 
ck-end server and client 


entation of aggregation 
nd, and additional tools 


sẹ servers use a relational 


ROLAP servers incorporate implem 
igation logic, optimization for each DBMS back e 
snd services. To record and maintain data warehouse, the 
orextended relational DBMS, and OLAP middleware to Support missing pieces 
The ROLAP approach is adopted by DSS server of microstrategy, ROLAP 
technology tends to have greater scalability as compared to MOLAP technology, 

(iti) Hybrid OLAP (HOLAP) Servers — These servers are the 
combination of both MOLAP and ROLAP technology, benefiting from the 
faster computation of MOLAP and the greater scalability of ROLAP. For 
example, a HOLAP server may store huge volume of detail data in a relational 
database, whereas aggregations are kept in a separate MOLAP store, A hybrid 
OLAP server approach is adopted by the Microsoft SQL Server 2000, 


(iv) Specialized SQL Servers — Some database system vendors 
implement specialized SQL servers to satisfy the growing demand of OLAP 
processing in relational databases. These servers offer advanced query language 


and query processing support for SQL queries over star and snowflake schemas 
ina read-only environment. 


0.6. Briefly discuss types of OLAP and also explain the processing of 
OLAP queries. 


Ans, Types of OLAP — Refer to Q.5. 
_ Processing of OLAP Queries — Refer to Q.4. 


(R.GP.V., June 2011) 
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is virtual data warehouse 


Í ? De 

Q.7. Write shori notes on — K gl. Beanie efine ROLAP MOLAP ang 

(i) ROLAP (ii) MOLAR (RGN, June 29 qp with & (R.GEV, Jur 
6) gôl «gal Data Warehouse — A virtual 1e 2012) 

dns. Refer to Q.S. Ans: virtua Set vata warehouse can be vie 
An: à OLAP MOLAP versus HOLAP views over eat N atabases. For efficient processing of ite as a 

iefly re ROLAK versus ©" saa AP sery 0 ‘ble views of summary ma . ery, onl 

0.8. Briefly compa (R.GRV., Dec. 2009, June Ver, | sot rthe possible ry may be materialized. A virtua] warehous. 


0 i AA 
2014) om? te capacity on operational database servers but is easy to build 


Ans. Refer to Q.5. se h oe awe 
Z P, MOLAF HOLA efer to Q.5. 
P different types of OL ROLA ’ 
0.9. Define OLAP. What are the four diffe pes of OLAP se £ Be cm ea | 


a > 3 „Jew ? Explain briefly. (R.G P.V., Jun | 
implementation point of view £ Exp e 2015 Wi 
from impl ) Q els 


fer to Q.1 and Q.5. NS. 
Ans. Refer to Q ‘ai „ummary type query, HOLAP leverages cube t 


| a ecl 
0.10. Discuss various types of ae ke alee =o g nally ! ance. When detail information is needed, it can init os faster 
stored in different server architectures < . ’ ) 2014) ae the underlying relational database. Cubes stored as sre H 


Ans. Types of OLAP Shot ala Bees Eto tok A Rie than equivalent a ase z pee quicker than ROLAP clibes 
ROLAP uses relational tables. The fact an | for queries N ic E Eons F EE k generally suitable for 
associated with a base cuboid is known as a base fact table that stores data i | pubes ie es based on a large amount 
the abstraction level shown by the join keys in the schema for the given dat, of base 08t- 
cube. Aggregated data can also be stored in fact tables, known as Summar | 
fact tables. Separate summary fact tables can be used for each leve] of| 
abstraction, to store only aggregated data. Some summary fact tables Store | 
both base fact table and aggregated data. | 
For example, a summary fact table that stores both base fact data and | 
aggregated data is illustrated in table 2.1. The schema of the table işl 
“<tecord identifier (RID), item, day, month, quarter, year, dollars _sold>” jy 
which dollars sold is the sales amount and day, month, quarter, and year define | 
the date of sales. Assume the tuples with an RID of 0001 and 0002, respectively 
The data of these tuples are at the base fact level, where the date of sales iş 
March 07, 2009 and March 18, 2009, respectively. Assume the tuple with an 
RID of 0050. This tuple is at a more general level of abstraction as compared to 
the tuples 0001 and 0002. The day value has been generalized to all, therefore Sources = 
the date of sales is March 2009. The special value all shows subtotals in| DW Star Schema QLAP Caching LDAP 


summarized data, For example, the dollars_sold amount shown is an aggregation. with Only Base Data Server Server Authentication 


Data Storage in ROLAP a 
for on-line analytical processing, 


Multidimensional 
Aggregated Cube 


— 
—— 


representing the entire month of March 2009. However, to store data for on-line) Fig. 2.2 HOLAP Architecture 
analytical processing, MOLAP uses multidimensional array structures. In order to deliver the combined strengths of MOLAP and ROLAP 
Table 2.1 technologies, HOLAP systems must comply with the following rules — 


| year | dollars_sold | (i) Fast access at all levels of aggregation (MOLAP requirement) 
250.60 | = (ii) Easy aggregate maintenance (MOLAP requirement) 


175.00 (iii) Compact aggregate storage (MOLAP requirement) — for high- 
level aggregates in order to economize disk space. 

_ (iv) Dynamically updated dimensions (ROLAP requirement) — real 
time access to the data itself and to rapidly changing structures. 


45,786.08 


| 
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(vy) Multidimensional view based on RDBMS metadata (ROL 
requirement) should point to the appropriate RDBMS tables and automatica 
generate reguired SQL statements when modifying the multidimensiona] fie 
it reduces development tims and maintenance. 

Example - Express from Oracle. IBM DB 2 OLAP serv 
OLAP services, sagent holos. 
0.13. Briefly discuss the advantages of ROLAP and MOLAP 
(R.GB.V., Dec. 201) 
Ans. Advantages of ROLAP - One advantage of using ROLAP iş ty 
sparse data sets may be stored more compactly in tables than arrays g 
ROLAP is an extension of the matured relational database technique, we 
take advantage of using SQL. In addition, ROLAP is very scalable. 
Advantages of MOLAP — MOLAP abandons the relational Structure and 
uses a sparse matrix file representation to store the aggregations efficiently This 
gains efficiently, but lacks flexibility, restricts the number of dimensions, and ig 
limited to small databases. One advantage of using MOLAP is that dense arrays 
are stored more compactly in the array format than in tables. In addition, array 
lookups are simple arithmetic operations which results in an instant response 


0.14. Write advantages and disadvantages of HOLAP. 
Ans. Advantages — There are several advantages of HOLAP as follows- 
(i) Combined advantages of both ROLAP and HOLAP — Refer to Q3 


Gi) Can combine the ROLAP technology for sparse regions and 
MOLAP for dense regions. Also ROLAP for storing the detailed data and 
MOLAP for higher-level summary data. 


Disadvantages — There are several disadvantages of HOLAP as follows- 


(i) Complex HOLAP server must support both MOLAP and 
ROLAP engines and tools to combine both storage engines and operations. 


\ 
W 


er, Microsof 


INce 
Can 


O <a 


(ii) Functionality Overlap between storage and optimization | 


techniques in ROLAP and MOLAP engines. 


0.15. Why ROLAP is used in the relational database environment ? 
(R.GP.V., Dec. 2011) 


Ans. ROLAP, also called multi-relational OLAP, is the fastest growing 


through a dictionary layer of metadata, thereby avoiding the requirement to 
create a static multidimensional data structure. ROLAP tools facilitate the 
creation of multiple multidimensional views of the two-dimensional relations. 


flexibility, some ROLAP tools recommend or require the use of highly de- 
“normalized database designs such as the star schema. 


style of OLAP technology. These tools support RDBMS products directly 


To improve performance, some ROLAP tools have developed enhanced SQL i 
engines to support the complexity of multidimensional analysis. To provide | 
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tools provide the benefit of full analytical functional 
he advantages of relational data. These tools ae ie 
ema design, and their technology js limited ue a 
rate tier architecture. As the data is physically ee i kadi 
cessing, sometimes the scope of analysis is limited I : mona 
in the dimensional structure requires a physical se Pi 
which is very time consuming. i ae RRS 


gl 6. Describe the architecture of MOLAP in detail with the help of 


itable diagram. ps à (R.GPV., June 2013 ) 
stl s. Data for analysis is stored in specialized multidimensional databa. 
Ans. AP model. Large multidimensional arrays are eedu ses 
structures. For instance, to store sales number of 1000 ae 
‘nter, in month number 2013/07, in store SmartShop, und r 
annel Channel20, the sales number of 1000 is stored in an A 
ted by the values (Printer, 2013/07, SmartShop, Chann el20). The Bees 
ify the location of cells. The intersections of the values of dimensi 
lls. You will realize that not all cells have values of metrics if 
u note how the cells are formed. The cells representing Tuesdays will all be 
js if a store is closed on Tuesdays. 


Presentation 


Layer 


MOLAP 


Create and Store 
Summary 
Data Cubes 


Data 
Layer 


rune 
RDBMS 


Data 
Warehouse 
2 Server 


ce Fig. 2.3 MOLAP Model 


The architecture of MOLAP model is shown in fig. 2.3. There are three 
layers in the multitier architecture. Precalculated and prefabricated 
multidimensional data cubes are stored in multidimensional databases. In the 


a  — 
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} i tegy — The à 

application layer, the MOLAP engine pushes a multidimensional viey of fi») Corporate Stra n 5 OLAP strategy should fit in with i 

`~ e ë b z h 
data from the MDDBs to the users. Multidimensional database managen.” jise strategy and eee objectives. A good fit will result EAA 
systems are proprietary software systems. These systems give the capability eit ing used more WI ely. j 
consolidate and fabricate summarized cubes during the process that loads x to f gals g) Focus on the Users — The OLAP project should eats 
into the MDDBs from the main data warehouse. The users requiring summarize tiers should, in consultation with the technical ari on 
data take benefit of fast response times from the preconsolidated data, he is will be done first ae w x will be done later Atei Pà E 

. hat . vith a tool s ; 
0.17. Explain the following — ee ig provide each ail eee for that person’s skill Eta 
G) Hypercube (ü) Metadata warehouse (iii) MDDB. ma needs. A goo user interface should be provided to non-technical 


. The pro) 


Ans. (i) Hypercube — In a hypercube, the data are stored as a mul vi) Joint Management — The OLAP project must ' 
ii T and business professionals. Many other people shou 


dimensional array where each cell in the array represents the intersection of al inthe! 

of the dimensions. Using this approach, any number of dimensions May hg A upplying ideas. An appropriate committee structure may b 
analyzed simultaneously, and any number of multidimensional views of the pe these ideas. 

data can be created. In such a design, however, many cells in the hypercube E (vii) Review and Adapt — Regular reviews of the project may be 
may not have a value. 


._ c mation ne h f 
(R.GPV., June 201) nfo™ ect can only be successful with the full Support of the users 


€ managed by 
ld be involved 
e necessary to 


; hat the project is meeting the current needs of . 
1 | quired to ensure t eeds ofthe enterprise, 
Gi) Metadata Warehouse — Refer to Q.73, Unit-I. | req : ; 
(iii) MDDB — A multidimensional database (MDDB), also known a | Q9. Differentiate aise FA and OLAP. (R.GPY, Dec. 2003) 
dimensional database, represents the multidimensional characteristics of day, ; ‘ate between operational datab 
internally. Data are stored by it in a giant hypercube. This preserves the data g | Differentia p EEr ne m E and data 
we can find data easily. In multidimensional database, data are stored in the warehouses. F , Dec. ; , June 2011) 


form of a multidimensional array rather than as separate tables and are useg | 


- ; : . i warehouse different from a database ? ni 
easily by the computer to locate any item of interest. Calculation of total, How is data ferent fi How are they similar ? 


averages, on any desired dimension is also easy, since the locations of igs (R.GPY., Dec. 2009, June 2014) 
systematically related data in the database adopt a regular pattern. An MD F 
pe retrieve ee data elements quickly. Pe How is a data warehouse different from a database ? How are they 
F š ; similar to each other ? (R.GP.V., June 2015) 
0.18. List the guidelines for OLAP implementation. | Or 
Ans, Following are the number of guidelines for successful implementation What is the difference between OLTP and data warehouse ? 
of OLAP, The guidelines are, somewhat similar to those presented for data | (R.GPV., June 2016) 


warehouse implementation. Or 
(i) Vision — The OLAP team must, in consultation with the users, List the difference between OLAP and OLTP. (R.GP.V., June 2016) 


develop a clear vision for the OLAP system. This vision including the business | Or 
objectives should be clearly defined, understood and shared by the stakeholders. List out the difference between OLTP and OLAP. (R.GP.V., May 2019) 
(ii) Senior Management Support — The OLAP project should be | j Or 


fully supported by the senior managers. Since a data warehouse may have Write the difference between OLAP and OLTP. (R.GP.V., Nov. 2019) 
been developed already, this should not be difficult. | Ans. The main task of on-line operational database systems is to perform 

(iii) Selecting an OLAP Tool — The OLAP team should familiarize | on-line transaction and query processing. Most of the day-to-day operations 
themselves with the ROLAP and MOLAP tools available in the market. Since Ofan organization are covered by these systems, like accounting, 
tools are quite different, careful planning may be required in selecting a tool manuta cturing, inventory, banking, registration, payroll, and purchasing. These 
that is appropriate for the enterprise. In some situations, a combination of | stems are known as on-line transaction processing (OLTP) systems. On the 
ROLAP and MOLAP may be most effective. | other hand, data warehouse systems serve users or knowledge workers in the 

f; 


a 


res 
nek 


warehousing 
making. In order to accommo dg 
these systems can organize ang f 
s are called on-line-analyticą] 

ifferences between OLTP (operatio 

ouse) are given below — 


decision ! the diy 


Tesen 
“Prog 
Nal q 


h 
R 


K 


At 


“hing 
e 
elationship data model. On the othe, » 4 


entity T ha 
a ae a star oF snowflake model. nd, 
e 


tem maintains current dat, 

nts- An OLTP sys ata 
a making. On the contrary, an OLAP system ha 
rical data that offers facilities for suMMarizat K 


. ; i 
huge amounts of histo nd manages information at distinct levels of gran 
ah 


"Stn, 
M iş Uy 
formaj, 
nalysig | 


database design. 
OLTP system us 


(iii) Users and Syste 


é system 7 
ei a an ery processing by clients, clerks, and in 
or transac 


i d for data 
‘onals. An OLAP system is use k 
tc a Saclay executives, managers and analysts, 
owle “ti Access Patterns — An OLTP system access patterns me 


hort atomic transactions. Such a system requires concurrency controla 
ee mechanisms. On the other hand, accesses to OLAP systems are ry 


only operations. d A 
(v) View- An OLAP system spans multiple versions of a datak 


schema due to the evolutionary process of an organization. OLAP dalag 
stored on multiple storage media due to their large volume. Integrating ¢ 
from many data stores, OLAP systems also deal with information that OFiging 
from various organizations. In contrast, an OLTP system emphasizes aa 
current data with an enterprise or department, without referring to histor 
data or data in various organizations. 

In addition to above features, other features that are used to different: 
between OLTP and OLAP systems are database size, frequency of operat 
and performance metrics. Table 2.2 summarized all these features. 


Table 2.2 


ER based, application-| Star/snowflake, subject-onesi 
oriented 
Clerk, DBA, database | Knowledge worker (e.g, 


professional manager, executive, analys 
ip) rhea ree OE SNe Historical; accuracy maintain 
tT  Lup-to-date over time 


_ f 
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Data in Information out 

100 MB to GB 100 GB to TB 

Read/write Mostly read 

Day-to-day operations [Long-term informational 
requirements, decision support 
Detailed, flat relational| Summarized, multidimensional 


Millions 


Primitive, highly detailed Summarized, consolidated 
Operational processing {Informational processing 
Thousands Hundreds 

Transaction Analysis 

Short, simple transaction |Complex query 

High performance, high|High flexibility end-user 
availability autonomy 

Index/hash on primarykey |Lots of scans 


Transaction throughput |Query throughput, response 
time 


0.20. What is data warehouse ? How is data warehouse different from 
a database ? (R.GP.V., June 2009, 2014) 


Ans. Refer to Q.1 (Unit-I) and Q.19. 


Q.21. Discuss various OLAP operations that can be performed on 


multidimensional data model. (R.GP.V., Dec. 2008) 
Or 
Discuss typical OLAP operations in brief. 
Or 
Briefly compare roll-up, drill-down, Slice and dice OLAP operations. 
(R.GB.V., Dec. 2009, June 2014) 


(R.GP.V., June 2015) 


Or 
Differentiate the following — 
(i) Roll-up and drill-down operations 
(ii) Slice and dice operations. 
(R.GP.V., June 2010) 
Or 
Explain OLAP operation in multidimensional data model. 
(R.GP.V., May 2018) 
Or 
Explain data cube operations. 
Or 
Discuss the typical OLAP operations with an example. 
(R.GP.V., May 2019) 
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Or 


Describe OLAP operations in multidimensional data model, 
i (R-GP. V., Noy 


dns. The basic operations of OLAP for multidimensional data 


2019 
are y 


follows - a ee 
(i) Slice — Slice operation is used to reduce the data cube p 


more dimensions. This operation is performed by a selection on one dim 
the given cube, providing a subcube. A slice operation on the data cube 
sales data are selected for the dimension time using time = Q, is shown 


Y One o 
Ension of 
Where the 
1n fig, F 


Location (cities) 


Slice for 
Time = Q4 


TV Computer Phone 
Items (types) 


TV Computer Phone Game 
Items (types) 


Gam 
Fig. 2.4 


(ii) Dice — Dice operation is also used for reducing the data cube by 
one or more dimensions. This operation is for selecting a smaller data cube and 
analyzing it from different perspectives. This operation is performed by a selectig, 
on two or more dimensions of the given cube, providing a subcube. In other 
words, we can say that it is performed by a slice on one dimension and they 
rotating the cube to select on a second dimension. A dice operation where th: 
sales data are selected from the data cube for the dimension time (time = Qua 
Q,) and location (location = Gwalior or Chicago) is shown in fig. 2.5. 

Fig. 2.6 shows a dice operation when we select more than two dimensions 
(e.g., three) that are time (time = Q; or Q3), location (location = Gwalior or 
Chicago) and items (items = Computer or Phone). 


TV Computer Phone Game 
Items (types) 


Fig. 2.5 


Computer Phone 
Items (pes) 


Fig. 2.6 


OLAP Systems 
(iii) prill-up — The drill-up operation deals with 
oan aggregated level within the same classificati 
is also known as roll-up operation. It does aggr 
dimension reduction or by climbing up a con 
gion. When drill-up is carried out by dimension redu 


89 
Switching from 3 
en hierarchy This 
egation on a data 
cept hierarchy for 
ction, one or more 
operation carried 
tion dimension is 
Ocation hierarchy 
the resulting cube 


Time (months) 


September 
Q4 October 


Time (quarters) 
Q 


November 
Ty Computer Phone Game December EE 
Items (types) 
TV Computer Phone Game 
Items (types) 
Fig. OY, Fig. 2.8 


(iv) Drill-down — The drill-down operation deals with switching 
from an aggregated to a more detailed level within the same classification 
hierarchy. This operation is the reverse of drill-up operation. Drill-down 
operation permits a user to get more detailed fact information by navigating 
lower in the aggregation hierarchy. This operation performs on data cube 
either by additional dimensions or by stepping down a concept hierarchy for 
a dimension. When drill-down is performed by additional dimensions, one 
or more dimensions are added to the given cube. A drill-down operation 
carried on data cube by stepping down a concept hierarchy for time dimension 
is shown in fig. 2.8. 

This occurs by descending the time hierarchy from the level of quarter to 
level of month. The resulting data cube details the total per month. 
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fv) Pivot — Pivot or rotate is a visualization operation. It TOtate 
data axes in order to offer an alternative data presentation. A pivot operatic the 
which the item and location axes in a 2D slice are rotated is shown in Nin 


Items (types) 


New York Chicago Agra Gwalior 
Location (cities) 
Fig. 2.9 

Q.22. What is OLAP ? Discuss the basic operations of OLAP, 
(R.GP.V., Dec. 2003 
Or } 
What do you understand by the term OLAP ? List the various typ 
OLAP operations which are supported by the OLAP tools. (R.GP.V., June 


Ans. Refer to Q.1 and Q.21. 


es of 
2 013) 


0.23. Explain multidimensional cubes and describe how the slice and 
dice technique fits into this model. (R.GP.V., Dec. 2011) 
Ans. Multidimensional Cubes — Refer to Q.74, Unit-I. 
Slice and Dice Techniques — Refer to Q.21. 
0.24. Illustrate how each of the following functions may be implemente 
in MOLAP. 
(i) The generation of a data warehouse 
(ii) Roll-up (iii) Drill-down. 


(R.GP.V., June 201) 
Ans. Refer to Q.49 (Unit-I) and Q.21. 


WAREHOUSE HARDWARE AND OPERATIONAL 
- SECURITY, BACKUP AND RECOVERY ` 


0.25. Describe the term hardware architecture. 


Ans. The hardware architecture of a data warehouse is defined within the 
technical blueprint stage of the process. The business requirements stage should 
have identified the initial user requirements and given an indication of the 

The hardware architecture is determined onct 
a broad understanding of the required technical architecture has been achieved) 
The backup and security strategies are also determined during the technica) 
blueprint phase. The classification of hardware architecture as shown in fig. 2.10) 


capacity planning requirements 


fig. 24 | 


a] 
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| Hardware Architecture 
| Network Hardware Client Hardware 
; 
Options Network s 

| architect lti-processing (SMP) Architecture Chat Tool 

gyom Network y 

MPP technology ? Management Tanagement 
} Cinstet dEm erging Technologies 
| New a Management 


Server 
Fig. 
5 The server is a crucial 
(i) Server 3 rucial part of the data 
use environment. To support the size of the database, and the ad hoc 
f the query access. Warehouse applications generally require large 
configurations. The different architectures are as follows — 
(a) Architecture Options — There are two main hardware 
s commonly used as server platforms in data warehouse solutions 


2.10 The Classification of Hardware Architecture 


Hardware — 


pchitecture 


processing 


: gh-speed memory 
nd non-uniform memory architecture (NUMA), are adding 


the SMP architecture allow them to scale to much higher level. 


(b) Symmetric Multi-processing (SMP) ~ An SMP machine 
isaset of tightly coupled CPUs that share memory and disk. This is sometimes 
called a shared everything environment. 


(c) Massively Parallel Processing (MPP) — An MPP machine 
is set of loosely coupled CPUs, each of which has its own memory and disk. 
MPP machine is made up of many loosely coupled nodes. These nodes will be 
linked together by a high-speed connection. The form that this connection 
takes varies from vendor to vendor. Each node has its own memory, and the 
- disks are not shared, each being attached to only one node. However, most 

MPP systems allow a disk to be dual connected between two nodes. This 
protects against an individual node failure causing disks to be unavailable, 


ae (d) Cluster Technology — A cluster is a set of loosely coupled 
SMP machines connected by a high-speed interconnect. Each machine has its 
Js and memory, but they share access to disk. Thus these systems are 
led shared disk systems. Each machine in the cluster is called a node. The 
cluster is to mimic a single larger machine. In this pseudo single 
resources like shared disk must be managed in a distributed fashion. 
f software known as the distributed lock manager is used to achieve 


| 
| 
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hould consider th OLAP Systems 99 
_We should consider the following 
each resource by ensuring that the resource can be changed on only che | dad gources 8 Possibilities dur; ng the 
of the cluster at a time. ~~ Doge | {HOF nase: 
© c ; bs “i psie a whether the new data sources will require new sec 

(© New and Emerging Technology — A non uniform mem ( tions to be implemented ? urity and/or 
architecture (NUMA) machine ts basically a tightly coupled cluster of SM „dit restri® whether the new users added who have restricted 
nodes. with an extremely high-speed interconnect. The interconnect need pi? (ii) enerally avail able ? “acted access to data 
be sufficiently fast to give near-SMP intemal speeds. Sto | already £ 


t 
(Ð Server Management — The management tools are a Sources are not 


fend such a situation, we need to use the k 
to manage a large, dynamic and complex system such as a data wareho i | yall E a sfdata warehouse to know likely pee of business 
These tools allow the automatic monitoring of most if not all the requi a © gpd the ° lowing activities get affected by security measur nts. 
processes and statistics such as hardware failure, a process dying, low bufa The ‘) User access Gi) Data aad es — 
it rati ss returning an error etc. €r 1 . 
cache hit ratios, a proces s (iii) Data movement (iv) Query generation. 
(ii) Network Hardware — The network hardware can play an im 
part in data warehouse’s success. Portay 0.27. What do you aa i by user access ? Explain, 
t classify t ; 

(a) Network Architecture — The network architecture an Ans. We ee a SEJA = pi then classify the users on the 
bandwidth are capable for supporting the data transfer and any data extraction pasis of the ae ae s TAN r words, the users are classified 
in an acceptable time. The transfer of data to be loaded must complete quiets gocording to the data y : 
enough to allow the rest of the overnight processing to complete. ) pata Classification — The two approaches can be used to classify the 

The main aspects of a data warehouse design that may be affected p = . 

s Yth | daa : : a ee a 
network architecture are user access, source system data transfer and da (i) Data can be canes according to its sentitivity. Highly-sensitive 
extractions. Each of these issues needs to be considered carefully. data iS classified as highly restricted and less-sensitive data is classified as 

ictive. 

(b) Network Management — Network management is a blag | Ie seine l : 
art and also requires specialist tools and lots of network experience, The peaca aso be classified according to the Job function. This 


management of the network has no direct effect on a data warehouse, excepi restriction allows eee a Mai particular data. Here we restrict 
for one issue. It is important to be able to monitor network performance, Th, CUSTS t0 VIEW ony 5 ata in which they are interested and are 
network may play a key part in data flow through a data warehouse environmen responsible oe 

There are some issues in the second approach. To understand, let’s have 
anexample. Suppose you are building the data warehouse fora bank. Consider 
that the data being stored in the data warehouse is the transaction data for all 
theaccounts. The question here is, who is allowed to see the transaction data. 


(iii) Client Hardware — Clients are external to the data warehoug 
system itself with the network. 


(a) Client Tools — The tools must not be allowed to affect the 
design of the data warehouse. _ The solution lies In classifying the data according to the function 


(b) Client Management — Management of the clients is beyond User Classification — The following approaches can be used to classify 
the scope of the data warehouse environment. Make sure those responsible for| the use 

the network are fully aware of all the dataware house network requirements 
and that they can meet them. 


i Users can be classified as per the hierarchy ofusers in an organization 

an be classified by departments, sections, groups, and so on. : 

) Users can also be classified according to their role with people 
a departments based on their role. 


ie, use 
2.26. Discuss the security requirements of data warehouse. 
Ans. Adding security features affect the performance oi the data 
warchouse, therefore itis important to determine the security requirements & 
early as possible. Hiis diffieuit to add security features after the data warehouse 
has gone live. 
During the design phase of the data warehouse, we should keep in mind 
what data sources may be added later and what would be the impact of adding 


on Based on Department — Let’s have an example of a data 
the users are from sales and marketing department. We can 
y top-to-down company view, with access centered on the 


ents, But there could be some restrictions on users at different 
tructure is shown in fig. 2.11. 


K 


s 
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Summarized 
Sales Data 


Fig. 2.1] User Access Hierarchy 


But if each department accesses different data, then we should desi 
the security access for each department separately. This can be achieved | 
departmental data marts. Since these data marts are separated from the day 
warehouse, we can enforce separate security restrictions on each data mar 
This approach is shown in fig. 2.12. 


Data 
i Warehouse 


Marketing 
Data Mart 


Users Users 


Fig. 2.12 Data Marts to Enforce Restrictions on Access to Data 


security restrictions as per the role of the user. The role access hierarchy 


shown in fig. 2.13. 


Classification Based on Role — If the data is generally available to al 
the departments, then it is useful to follow the role access hierarchy. In othe 
words, if the data is generally accessed by all the departments, then appl 


Data 
Warehouse 


Hee a 


Reference Summarized 
Data Sales Data 


Detailed 
Sales Data 


Fig. 2.13 Role Access Hierarchy 


0. 28. Write short notes on the following — 
g Audit requirements (ii) Network requirements, 


Ans. (i) Audit Requirements — Auditing is a subset of securit 
sity Auditing can cause heavy overheads on the system. To eerie a 
oat time, we require more hardware and therefore, it is ri ne 
hat wherever possible, auditing should be switched off. Audit require on 
can be categorized as pons a. ki 

(a) Connections (b) Disconnections 
(c) Data access (d) Data change. 

For each of the above-mentioned, categories, it is necessa t i 
success, failure, or both. From the perspective of security reasons aie S di a 
of failures are very important. -Auditing of failure is important beca : so 
can highlight unauthorized or fraudulent access. ae 


(ii) Network Requirements — Network security is as important as 
other securities. We cannot ignore the network security requirement. We need 
to consider the following issues — ; 

(a) Is it negessary to encrypt data before tr i 
ansferring it t 
data warehouse ? : Fete 
a í (b) Are there restrictions on which network routes the data can 
Che Aa ; 
These restrictions need to be considered carefully. Following are the points 


to remember — 
en: a (a) The process of encryption and decryption will increase 
k nea Ik would require more processing power and processing time. 
he (b) The cost of encryption can be high if the system is already a 
system because the encryption is borne by the source system. 


i 
f 


w’? 


ia 


aT why 
> 


i 
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Q29. Define the terms - | gata. A150 there may be requirements of extra metadata to hand 
(i) Data movement (ii) Documentation. | e ec 5 andle any 

. “a i xtra v 

Ans ) Data Movement — There exist potential security implica, | lt create and maintain ¢ t : ae warehouse manager may 
while moving the data, Suppose we need to transfer some restricted dt fo des t0 enforce Dia, 3 ra checks may have to be idea €quire 
Rat file to be loaded. When the data is loaded into the data warehoue, ay, | itd oF house to prevent it ae TE fooled into moving data into a S the 
following questions are raised — 5e, the wi should not be avait ae € query manager requires the cha nv 
(a) Where is the flat file stored ? a if any access restrictions. The query manager will need to be Ree to 

: 5 3 ar 

(b) Who has access to that disk space ? | ae yiews and aggregations e of 


If we talk about the backup of these flat files, the following questions 
raised — ar 
(a) Do you backup encrypted or decrypted versions ? 
(b) Do these backups need to be made to special tapes that 
stored separately ? 
(c) Who has access to these tapes ? 
Some other forms of data movement like query result sets also need to 
considered. The questions raised while creating the temporary table aie 
follows — 


arg 


be 
a 


(a) Where is that temporary table to be held ? 
(b) How do you make such table visible ? 


We should avoid the accidental flouting of security restrictions. Ifa Us 
with access to the restricted data can generate accessible temporary table 
data can be visible to non-authorized users. We can overcome this problem 
by having a separate temporary area for users with access to restricted data 


(ii) Documentation — The audit and security requirements need t 
be properly documented. This will be treated as a part of justification, Ths 
document can contain all the information gathered from — 


7 (a) Data classification 

(b) User classification 

(c) Network requirements 

(d) Data movement and storage requirements 
(e) All auditable actions. 


0.30. Describe the impact of security on design in data warehouse, 


Security affects the following area — 


eee Y 


Ans. Security affects the application code and the development timescales 


(i) Application Development — Security affects the overall applicatiot fe shy 
development and it also affects the design of the important components of the is 
data warehouse such as load manager, warehouse manager, and query manage! 
The load manager may require checking code to filter record and place then 
in different locations. More transformation rules may also be required to hit 


| gle i Database Design — The database layout is als 
ity measures are implemented, there is an incr 
and tables. Adding security increases the size 


0 affected because 
ease in the number 
of the database and 


ae Management. It 
: recovery pla 
vil (iii) Testing — Testing the data warehouse is a co a 


Adding security to the data warehouse also affects the testing ti 
It affects the testing in the following two w TER Sting time 


(a) It will increase the time required for integration and 
System 


mplex and lengthy 


pores.” 
plexity: 


sing: : : 
t (b) There is added functionality to be tested which wil 


the size of the testing suite. 
031. Explain in brief about backup terminologies, 


Ans, Adata warehouse is a complex system and it contains ah 
. . . u e 
fdata. Therefore it is important to back up all the data so that R ie 
wvailable for recovery ın future as per requirement. oe 
Some of the backup terminologies are given as follows — 


_ (i) Complete Backup — \t backs up the entire database 
: : at th 
‘ime, This backup includes all the database files, control files, aud iat fe 


(ii) Partial Backup — As the name suggests, it 
complete backup of the database. Partial bibl is E RE 
| databases because they allow a strategy whereby various parts of the dates 
ate backed up in a round-robin fashion on a day-to-day basis, so that hi 
whole database is backed up effectively once a week, : : 


as (iii) Cold Backup a Cold backup is taken while the database is 
complet ‘shut down. In multi-instance environment, all the instances should 


l increase 


= tw) Hot Backup — Hot backup is taken when the database engine is up 
ine requirements of hot backup varies from RDBMS to RDBMS. 


ine Backup — It is quite similar to hot backup. 
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warehouse. 


Ans. It is important to decide which hardware to ys 
speed of processing the backup and restore depends e 
used, how the hardware is connected, bandwidth of t 
software, and the speed of server’s I/O system. Here we 
the hardware choices that are as follows — 


0.32. Discuss the various hardware backup tech 
Nolo 
ey Use in 


4 


for the 


bac 
th e hardy? } 


he Network p 
Will discus n 

t; 
(i) Tape Technology — There are a large number A 


the market. The tap choice can be categorized as follows — 


(a) Tape Media — There exists several Varieties 
Some tape media standards are listed in the table 2.3. 


Table 2.3 Tape Media 


DLT 


L 40 GB 
3490e 1.6 GB 
8 mm 14 GB 


Other factors that need to be considered are as follows — 
(1) Reliability of the tape medium 
(2) Cost of tape medium per unit 
(3) Scalability 
(4) Cost of upgrades to tape system 
(5) Cost of tape medium per unit 
(6) Shelf life of tape medium. 
rae acts cae Tape Drives — The tape drives can be conne 
(1) Direct to the server 
(2) As network available devices 
(3) Remotely to other machine. 
There could be issued in connecting the tape drives to a data warehou 


(1) Consider the server is a 48 ine. 

: node MPP machine. Wi 
not know the node to connect the tape drive and we do not know how! 
spread them over the server nodes to get the optimal performance with lë 
disruption of the server and least internal I/O latency. 

(2) Connecting the tape drive as i i dev 

anetw: jk; 

requires the network to be up to the job of the huge data eae rf 
sure that sufficient bandwidth is available during the time your requie it 


(3) Connecting theitape drives remotely also require hit 


f tape backyy 
4 


of tape Teg: 


bandwidth. 


(c) Tape Stackers — The method of loading multiple tapes E 


a single tape drive is known as tape stackers. f 3 
gle tap p The stacker dismounts the curt! 
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4 it and loads the next tape, hence only one tape 
yi t _The price and the capabilities may vary, 
tape table at 2 time © 4g that they can perform unattended backups. 


ge thousan 


have the software an 
X ry common for the silo to be connected remotely over a 


ensure that the bandwidth of the 


veg (a) Disk-to-disk backups 
(b) Mirror breaking. S 
ethods are used in the OLTP system. These methods minimize 
wntime and maximize the availability. 
(a) Disk-to-disk Backups — Here backup is taken on the disk 
e tape. Disk-to-disk backups are done the following reasons — 
(1) Speed of initial backups 
(2) Speed of restore. 
Backing up the data from disk-to-disk is much faster than to the tape. 


ntermediate step of backup. Later the data is backed up on 
-disk backups is that it gives you an 


These M 
the database do 


rather on th 


However it is the i 
the tape. The other advantage of disk-to 


online copy of the latest backup. 
(b) Mirror Breaking — The idea is to have disks mirrored for 


resilience during the working day. When backup is required, one of the mirror 
sets can be broken out. This technique is a variant of disk-to-disk backups. 
The database may need to be shutdown to guarantee consistency of the 


backup. 


0.33. Write short note on optical jukeboxes. 

Ans. Optical jukeboxes allow the data to be stored near line. This 
technique allows a large number of optical disks to be managed in the same 
way as a tape stacker or a tape silo. The drawback of this technique is that it 
has slow write speed than disks. But the optical media provides long-life and 
reliability that makes them a good choice of medium for archiving. 


0.34. Explain in brief about the software backup. 

Ans. There are software tools available that help in the backup process. 
These software tools come as a package. These tools not only take backup., 
they can effectively manage and control and backup strategies. There are mati) 
software packages available in the market. Some of them are listed in the 


oe LL 
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following table 


Table 2.4 Software Backup Packages 


Networker Legato 


at 


i ADSM IBM 

a Epoch Epoch systems 

: Omniback II HP 
Alexandria Sequent 


The criteria for choosing the best software package are as fol] owe 

(i) How scalable is the product as tape drives are added ? z 
(ii) Does the package have client-server option, or must it run a 

database server itself ? thy 
Gii) Will it work in cluster an 
(iv) What degree of parallelism is required ? 
(v) What platforms are supported by the package ? | 
(vi) Does the package support easy access to information about tl 


d MPP environments ? 


| 
| 
contents ? | 

(vii) Is the package database aware ? | 


(viii) What tape drive and tape media are supported by the Package! 


0.35. Write short note on recovery strategies. 


Ans. The recovery strategy will be built around the backup strategy, An 
recovery situation naturally implies that some failure has occurred, 7, 
recovery action will depend on what that failure was. As such a recov 
strategy will consist of a set of failure scenarios and their resolution. Whatey 
software you choose, the recovery steps for the failure scenarios below ny 
to be fully documented — 

(i) Media failure : 
(ii) Damage or loss of a table 
(iii) Damage or loss of a redo log file 
(iv) Damage or loss of tablespace or data file 
(v) Damage or loss of control file | 
(vi) Damage or loss of archive log file 
(vii) Instance failure 
(viii) Failure during data movement 
(ix) Other scenarios. 
The recovery plan for each of the scenarios should take into accounti 
use of any operations that are not logged. i 


„data minin 


INTRODUCTION TO DATA 
& DATA MINING 


CTION TO DATA & DATA MINING — 
(OF DATA, DATA PREPROCESSING, pared, 
S, SUMMARY STATISTICS, DATA DISTRIBUTIONS 


0.1. Define the term ‘data and data mining’. 


Ans. Any symbolic representation of facts or ideas from which i ; 
„an potentially be extracted is called data. ich information 


The term data mining means extracting or mining knowledge fi 
amounts of data. Actually, this term is not appropriate because ie ee huge 
gold from rocks is referred to as gold mining instead of rock minin ae of 
data mining should be called knowledge mining from data, but i i MEO 
shorter term that may not reflect the emphasis on mining au huge F ong. A 
data is knowledge mining. However, mining characterizes ia Dh ounts of 
discovers precious nuggets from raw material. Therefore, a term that r a cess that 


2 3 ta? ‘Ces Be ro 1 
and carries both “data” and “mining” became a popular choice ppropnate 


0.2. Write short note on data types in data mining 


Ans. Most data mining systems that are available on the market handl 
formatted, record based, relational like data with numerical, cate ori hi ; 
symbolic attributes. The data could be in the form of ASCII ah relati i 
database data, or data warehouse data. It is important to check alia a5 
format (s) each system you are considering can handle. Some kinds of Pio 
applications may require specialized algorithms to search for patterns, and 
their requirements may not be handle by off-the-shelf, generic dita niai 
Be Instead specialized data mining systems may be used, which “ihe 
s K ae geospatial data, multimedia data, stream data, time series 
aoe ata, or Web data, or are dedicated to specific applications 

ance, the retail industry or telecommunications). Moreover, many 


ee companies offer customized data mining solutions that incorporate 
ata mining functions or methodologies. 


4 ; i 
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Q.3. Briefly, write short note on the quality of data, | 


Ans. The existence of data alone does not ensure that all the Mana 
as can be smoothly undertaken. The one defin 
data quality is that it's about bad data - data that is missing or inco 
invalid in some context. A broader definition is that data quality is a 
when organization uses data that is comprehensive, understandable, co 
mler ant and timely. Understanding the key data quality dimensions jg the f 
step to data quality improvement. To be process able and interpretable in 
effective and efficient manner, data has to satisfy a set of quality criteria. 5 an 
satisfving those quality criteria is said to be of high quality. Abundant atema 
have Seen made to define data quality and to identify its dimensions. Dimensin® 
of data quality typically include accuracy, reliability, importance, consisten Ns 
precision, timeliness, fineness, understandability, conciseness and usefulnee 
We have under taken the quality criteria by taking 6 key dimensions as depict 


below in fig. 3.1. 


BeMen, 
Mon of 
Tect Or 
Chievey 
MSisteny 


functions and decisiot 


Fig. 3.1 Data Quality 


(i) Completeness — Deals with to ensure is all the requisite informatio, 
available ? Are some data values missing, or in an unusable state ? 


(ii) Consistency — Do distinct occurrences of the same data instances 
agree with each other or provide conflicting information. Are values consisten 


across data sets ? 
(iii) Validity — Refers to the correctness and reasonableness of data 
(iv) Conformity — Are there expectations that data values conformi 


specified formats ? If so, do all the values conform to those formats’ 
Maintaining conformance to specific formats is important. 


I 


(v) Accuracy — Do data objects accurately represent the “realworld’ 
values they are expected to model ? Incorrect spellings of product or perso 
names, addresses, and even untimely or not current data can impact operation 
and analytical applications. 


(vi) Integrity — What data is missing important relationship linkages’! 
The inability to link related records together may actually introduce duplication 
across your systems. 


Introduction to Data & Data Mini 
lirin 


4 what is data preprocessing ? 9 103 
2 Refer to Q.21, Unit-l 
os Write short note on similarity measures, 
Ane variety of different similarity measures can be used to 
similarity petween the item and the search statement. A characte ios ulate 
ihe waft) formula is that the results of the formula increase as tie beans ofa 
peš similar. The value is zero if the items are totally dissimilar. An, — 
m simple “sum of the products similarity measure to determine th example 
aii documents for clustering purposes is — © similarity 
be 


s[M(Item,, Item;) = Z(Term;x) (Terma) 


This formula uses the summation of the product of t 


i ; he vari 
woitems when treating the index as a vector. If item. is kirta terms of 
hen the same formula generates the similarity between every item ith Query; 
m with this simple measure is in the normalizati j and Query;. 


: ii on neede 
for variances in the length of items. Additional normali dto account 


zation i 
have the final results come between zero and +1 (some PE ia hnitin to 
petween -1 to +1). e the range 


statistical indexing and similarity functions model of Ro 
Jones suggests that knowledge of terms in relevant items retri 
should adjust the weights of those terms in the weighting proc 
number of relevant documents versus the number of non-rele 
the database and the number of relevant documents having a 
versus the number of non-relevant documents having that t 
formulas for weighting. This assumption of the availabilit of rel 
information in the weighting process was later relaxed by io a 3 -mis 
Croft expanded this original concept, taking into account the es arper. 
occurrence of terms within an item producing the following acini hens 


bertson and Spark 
eved froma query 
ess. They used the 
vant documents in 
specific query term 
€rms to devise Ibur 


Q 
SIM(DOC;, QUERY;) = $` (C + IDR) *f, j) 


where C = Constant used in tuning 
IDF; = Inverse document frequency for term “i” in the collection 
Rix - 
ee eel 
maxtfreq ; 


K = Tuning constant 
TF; j = Frequency of term is lad item a 
maxfreq; = Maximum frequency of any term in item “j”. 
The best values for K seemed to range between 0.3 and 0.5. 


S. 


$ 
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techniques. ve are, 
Ams. Searching in general is concerned with calculating the sj 
benveen a user’s search statement and the items in the database. 
many of the older systems are unweighted, the newer classes of in 
retneval systems have logically stored w eighted values for the indexes t 
item. The similarity may be applied to the total item or constrained to |g oan 
passages in the item. For example, every paragraph may be defined as a cae a 
or every 100 words. &: 
In this case. the similarity will be to the passages versus the total itg 
Rather limiting the definition of a passage to a fixed length size, locality E 
similarity allows vanable length passages (neighbourhoods) based ie 
similarity of content. In results presented at TREC-4, it was discovered the 
passage retrieval makes a significant difference when search statements a 
long (hundreds of terms) but does not make a major difference for short queries 
The lack of a large number of terms makes it harder to find shorter passages 
that contain the search terms expanded from the shorter queries. 


mMmilart 
Althoug, 
formation 


Once items are identified as possibly relevant to the user’s query, it is bey 
to present the most likely relevant items first. This process is called “ranking” 
Usually the output of the use of a similarity measure in the search process jg 
scalar number that represents how similar an item is to the query. 


Similarity Measure based on Cosine Formula — The most common 
similarity formula was used by Salton in the SMART system. Salton treated 
the index and the search query as n dimensional vectors. To determine the 
“weight” an item has with respect to the search statement, the Cosine formula 
is used to calculate the distance between the vector for the item and the vector 
for the query — 


n 
YC; k *QTERMj x) 
SIM(DOC,, QUERY) = -= 


n 
5 (DOC, Day 
kal 


n 
Z OTERM i JA 


where DOC. , is the kth term in the weighted vector for item “i” and QTERM,, | 


x aie 5 7 ; 
is the kth term in query “j.” The Cosine formula calculates the cosine of the 


angle between the two vectors. As the Cosine approaches “1,” the two vectors 
become coincident (i.e., the term and the query represent the same concept). 
If the two are totally unrelated, then they will be orthogonal and the value of 
the Cosine is “0.” It does not take into account the length of the vectors 
For example, if the following vectors are in a three dimensional (three term) 
system — 


sem = (4, 8, 0) 

Query 1 = OR 2i 0) 

Query 2 = (3, 6, 0) 

: Jue is identical for both queri , 

Cosine varus © : es even though 
ely higher weights in the terms in common. To hike a oes 
sg a Buckley changed the term factors in the query to - ormula, 
alton QTERM,; x = (0.5 + (0.5 TF, /maxfreq,))*IDF, 

.. the frequency of term “i” in query “k”, é } 
here Tk» ny term in query “k” and IDF, i a sf maxtreg, is the maximum 
ency of any wot 1S the inverse document fr 
ee «j», In the most recent evolution of the formula, the IDF ae 
í r nas 


Jaccard and the Dice Similarity Measure — Two other en 
asures are e Joccard and the Dice similarity measures. Bo 
rmalizing ee eae denominator to account for different 
"ti data. The denominator in the Cosine formula is EREE 
if terms in common and produces very small numbers when 
arge and the number of common elements is small. In the Ja 
measure, the denominator becomes dependent upon the nu 
common. AS the common elements increase, the similari 
decreases, but is always in the range of -1 to +1. 
The Jaccard formula is — 


mmonly used 
th change the 
characteristics 
t to the number 
the vectors are 
ccard similarity 
mber of terms in 
ty value quickly 


n 

> (DOC; k *QTERM, x) 
kal % 

n n 


2, DOC; x + Ž QTERM;y 
=] k=l : 


SIM(DOC,, QUERY) = 


n 
- J (DOC; k *QTERM jy) 
k=l 3 


The Dice measure simplifies the denominator from the Jaccard measure 
and introduces a factor of 2 in the numerator. The normalization in the Dice 
formula is also invariant to the number of terms in common. 


n 
-SIM(DOC;, QUERY) = == 


n n 


Ans. Statistics is a component of data mining that provides the tools and 


s my ifles techniques for dealing with large amounts of data. It is the science of 
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leaming from data and includes everything from collecting and organiz 


analyzing and presenting data. Statistics focuses on probabilistic BE to 
specifically inference, using data. dels, 
While the aims of statistics and data mining are similar, it is estim 
sate | 


that there are very few statisticians to deal with the demands of data anal 

The two types or statistics prevalent are descriptive and inferential. Descrj mm | 
statistics organize and summarize the data for the sample. The metho agi Ve 
of using these summaries to draw conclusions from entire data sets, js eae 
inferential statistics. ed 

Both data mining and statistics are related to learning from data, They q 
all about discovering and identifying structures 1n data, with the aim of tumine 
data to information. And although the aims of both these techniques overia. 
they have different approaches. p, 

Statistics is only about quantifying data. While it uses tools to find relevan 
properties of data, it is a lot like math, It provides the tools necessary for data 
mining. Data mining, on the other hand, builds models to detect patterns ang 

relationships in data, particularly from large data bases. 

Statistics and Machine Learning — Along with the development of 
statistics and machine learning, there is a continuum between these two subjects 
Statistical tests are used to validate the machine learning models and to evaluate 
machine learning algorithms. Machine learning techniques are incorporated 
with standard statistical techniques. 

Statistics and R —R is a statistical programming language. It provides, 
huge amount of statistical functions, which are based on the knowledge of 
statistics, Many R add-on package contributors come from the field of statistics 
and use R in their research. 

0.8. Discuss the concept of data distribution algorithm, 


Ans. The data distribution algorithm (DDA) demonstrates task parallelism, 
Here the candidates as well as the database are partitioned among the processors, 
Each processor in parallel counts the candidates given to it using its local 
database partition. Following our convention, we use ce to indicate the 
candidates of size k examined at processor X’. Also, L are the local large k- 
itemsets at processor l. Then each processor broadcasts its database partition 
to all other processors. Each processor then uses this to obtain a global count 
for its data and broadcasts this count to all other processors. Each processor | 
then can determine globally large itemsets and generate the next candidates, 
These candidates then are divided among the processors for the next scan. In 
data distribution algorithm, we show that the candidates are actually sent to 
each processor. However, some prearranged technique could be used locally 
by each processor to determine its own candidates. This algorithm suffers from 
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traffic whose impact can be reduced by oy erlappin 


pessa? 
provessine 
ut : 
o l //Itemsets 
xl, Xo en xP, //Processors 
y=Y', Y?, -e YP;  MDatabase divided into partitions 
5 //Support 
Output: 


p Large itemsets 


pata distribution algorithm : 
c,=5 
foreach1<1<pdo //Distribute size 1 candidates to each 


processor. 
determine c! and distribute to X!; 


perform in parallel at each processor X!; //Count in parallel, 


k= 0: //k is used as the scan number. 
L=$; 
repeat 
k=k+ l; 
Ly = 43 
for each I; € C} do 
c! =0; /Anitial counts for each itemset are 0, 


for each t; € Y! do 
for each I, e C} do 
If]; € t; then 
cl =c! +1; //Determine local counts. 
broadcast Y! to all other processors; 
for every other processor m # 1 do 
for each te Y™ do 
for each I, e C} do 
IfI; € ti, then 
c! =c! =l; //Determine global counts. 
ifc; > (s x |X! U X? u...... U XP) do 


Ly =L Ul; 


g communication 
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broadcast L}, to all other processors , 

Ly, = u, uli TEE ie //Global large k-itemsets 
C,., = Apriori-gen(L,) 

Chere Crer 


until C}, =Q; 

Fig. 3.2, illustrates the approach used by the DDA algorithm ų 
grocery store data. Here there are three processors Xl is counting 
Bread, X? is counting Jelly and Milk, and X? is counting PeanutButte, T) 
first two transactions initially are counted at X!, the next two at X?, and k 
last one at X?. When the local counts are obtained, the database partition : 
then broadcast to the other sites so that each site can obtain a global coy 


//Determine next set of local candidat 
Cs 


Sing the 
Beer and 


S are 
nt, 


Counts : 
Peanut Butter 0 


Counts : 
Jelly 0 
Milk 1 


Counts: 
Beer 0 
Bread 1 


Fig. 3.2 Data Distribution 


ETA ANE ADL AIT LT BE SELES LS E E 
BASIC DATA MINING TASKS, DATA MINING v/s 
KNOWLEDGE DISCOVERY IN DATABASES, 

ISSUES IN DATA MINING 
(eee IRI AARNE ARE RTI E LEIS ESE Os tg 


0.9. Describe the architectural view of data mining system. Also explain 
each component briefly. 


Or 
Explain major components of a data mining system architecture. 
(R.GP.V., Dec. 2003) 
Or 
Explain in brief the major components of a typical data mining system 
architecture. (R.GP.V., June 2009) 
Or 
What is use of knowledge base ? Write the architecture of typical data 
mining system. (R.GP.V., June 2012) 


Ans, The architecture of a typical mining system is shown in fig. 3.3. This 
architecture incorporates a database and/or a data warehouse and their 
appropriate servers, a data mining engine, a pattern evaluation module, a 


Introduction to Data & Data Minin g 109 


. terface, and a knowledge base. Data mining 
h tabase or data warehouse system can use e 5 
Ph onw! semitight coupling, or tight coupling. The m 
s follows - 


components 
r no coupling, 
ajor components 


ithe 


, patabase, Data Warehouse, World Wide Web, or other In in 
pepository — It can be a single or a collection of databases data 
on spreadsheets, or other types of information repositories. Dat, 
arene gration and selection operations are applied on data, if Rented 


Pattern Evaluation T 


i 


Database or 
Data Warchouse Server 


Greets terseewane ekassasaanusenasanas 


! Data Cleaning, Integration and Selection | 


S58 


Fig. 3.3 Data Mining System Architecture 


(ii) Database or Data Warehouse Server — On the basis of user data 
nining request, the related data is fetched by the database or data warehouse server. 


Data 
Warehouse 


(iii) Knowledge Base — Knowledge base is the domain knowledge. 
Domain knowledge guides the search or determines the interestingness of 
resulting patterns. Such knowledge can incorporate concept hierarchies which 
ae employed to arrange attributes or attribute values into various levels of 
abstraction. Knowledge such as user beliefs may also be included. 


(iv) Data Mining Engine — lt composed of a set of functional 
modules for tasks like characterization, prediction, classification, cluster 
malysis, outlier analysis, evolution analysis, and association and correlation 
analysis. Data mining engine is essential to the data mining system. 


(v) Pattern Evaluation Module — it uses interestingness measures 
ad communicates with the data mining modules to concentrate the search 
ward interesting patterns. It may use interestingness thresholds to filter out 
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Í 
Q.14 Write short note on practical applications of data Minin | 
(RGPVy Dan 


Or a 200 | 
Write short note on data mining applications. (R.G Py, Dec, > | 

Or A 909) | 
Discuss data mining application areas in detail. (R.G P.V, June z 
Ans. Some of the applications of data mining are as follows _ w 


G) Data Mining for the Retail Industry — For data min 
main application area is retail industry, because it gathers large am 
data on sales. customer shopping history, consumption, goods transp 
and service. The gathering of data continues to expand rapidly, espec 
to the increasing ease, availability, and popularity of business cond 
the Web, or e-commerce. At present, there are many stores that have 
where customers can make purchases on-line. 

Retail data mining can aid recognize customer purchasing behay 
find customer shopping patterns and trends, obtain better customer ret 
and satisfaction, enhance quality of customer service, improve 
consumption ratios, design more effective goods transportation polici 
decrease the business cost. 

(ii) Data Mining for Financial Data Analysis — Many banks ani 
financial institutions provide a large number of banking services, Credit, an 
investment services. Some also provide insurance services and stock investmen 
services. Financial data gathered in the banking and financial industry at 
often relatively complete, reliable, and of high quality which makes easy 
systematic data analysis and data mining. Some typical cases are — 

(a) Loan payment prediction and customer credit policy analysis 

(b) Classification and clustering of customers for targeted marketing 

(c) Design and construction of data warehouses for multidimey. 
sional data analysis and data mining. 

(d) Detection of money laundering and other financial crime, 
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(iii) Data Mining for the Telecommunication — The telecommun. 
cation industry has quickly evolved from offering local and long-distance 
telephone services to providing many other comprehensive communication 
services, such as pager, Internet messenger, fax, cellular phone, images, e 
mail, computer and web data transmission, and other data traffic. The integration 
of telecommunication, Internet, computer network, and numerous other mean 
of communication and computing is also underway. In addition, th 
telecommunication market is rapidly expanding and highly competitive, wit 
the deregulation of the telecommunication industry in many countries and th) 
development of new computer and communication technologies. This create 
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nd for data mining in order to help understand the 


j gest der cosnize telecommunication patterns, make better me 
vol? f enhance the service quality and catch fraudlent activities E 
sours (i) pata Mining for Biological Data Analysis — The past d 

ared an increase 1n proteomics, genomics, functional eect 
1s apPe a research. Biological data mining has become an moi and 
piomediC? ech field known as bioinformatics. It is not possible to ae of 
anew ni and flourishing theme in one subsection because the fi ia 
af im data mining 1s rich, broad and dynamic. aoe 


Recently, data 


torage technologies have improved. That’s why, today scientific 


mulated at much higher speeds and lower costs. This leads t 
on of large amount of high-dimensional data, stream data ne 


erogencous data, containing rich spatial and temporal information 
het 


employing 5? 


chem 
0.15. List any four data mining applications. (R.GP.V., June 2016) 


Ans. Refer to Q.1 4, 


0.16. Discuss the application of data mining in the banking industry, 
if (R.GB.V, Dec. 2011) 
Ans. Refer to Q.14 (ii). 
0.17. What are the basic differences between OLAP and data mining. 
Or 
Compare and contrast the two basic ways to analyse the data in data 
warehouse. (R.GRV., June 2012) 
Ans. Table 3.1 shows the basic differences between OLAP and data mining. 
Table 3.1 


§.No.| Characteristics OLAP | Data Mining | 


(i) | Motivation for What is happening | Predict the future based 
information in the enterprise ? | on why this is happening. 
request. 

Sizes of datasets 
for the dimensions. 


Number of dimen- 
| sion attributes. 


Not large for each 
dimension. 


Usually very large for 
each dimension. 


Small number of 
attributes. 


Many dimension 
attributes. 
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Limited number of 
dimensions. 


Number of business 
dimensions 


Large number of 
dimensions. 

Data-driv 
knowled 


User-driven, inter- 
active analysis 


Analysis approach. en automat 
atic 


ge discovery 
Prepare data, launch 
mining tool and Sit back 


Multidimensional, 
drill-down, and 
slice-and-dice. 


Mature and widely 
used. 


Analysis techniques. 


State of the 
technology. 


Still emerging; som 
of the technology 
mature. 


© Parts 
More 


Detailed trans 
data. 


Data granularity Summary data. 


action-leyę] 


Q.18. Describe the various data mining techniques. 


Ans. The two fundamental objectives of data mining are prediction and 
description. Prediction makes use of available variables in the database i 
order to predict unknown or future values of interest and description emphasizes 
on finding patterns describing the data and the subsequent presentation for 
user interpretation. The relative emphasis of both prediction and description 
differ with respect to the underlying application and the technique. There are 
various data mining techniques satisfying these objectives. Some of these are 
associations, classifications, sequential patterns and clustering. The basic 
premise of an association is to determine all associations, such that the presence 
of one set of items in a transaction implies the other items. Classification 
develops profiles of different groups. Sequential patterns identify sequential 
patterns related to a user-specified minimum constraint. Clustering divides a 

database into subsets or clusters. Another approach of the study of Dy 
techniques is to classify the techniques as — 
(i) Discovery-driven or automatic discovery of rules 
(ii) User-guided or verification-driven data mining. 
Most of the techniques of data mining have elements of both the models, 


(i) Verification Model — In this process of data mining, the user 
makes a hypothesis and tests the hypothesis on the data to verify its validity, 
The focus is on the user who is responsible for formulating the hypothesis and 
issuing the query on the data to affirm or negate the hypothesis. 

In a supermarket, for example, with a limited budget for a mailing 
campaign to launch a new product, it is significant to recognize the section of 
the population most likely to buy the new product. The user formulates a 
hypothesis to identify potential customers and their common features. Historical 
data about transactions and demographic information can then be queried to 
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rable purchases and the features shared by those purchs 

eration can be repeated by successive refinements of h aed 

complete isnot reached. The user may come up with a ae Mae 
ne the existing one and verify it against the database nee 


op 0) piscovery Model — The discovery model differs in 
stem automatically discovering important in forr 
ata is sifted in search of frequently occurring p 
pe data: ons about the data without interventions or gui 
hich the rules are found relies on the cl 


its emphasis, 
mation hidden 
atterns, trends 
dance from the 
ass of the data 
seh plication. 


mining rp e of such a model is a supermarket database, that is E, 
An “ticular groups of customers to target for a mailing campaign. ae 
find the E rched with no hypothesis in mind other than for the system to grou 
gata ae according to the common characteristics found. Following id 
ihe siscovery driven tasks = 
ie (a) Clustering 
(b) Deviation detection 
(c) Discovery of association rules 
(d) Discovery of classification rules 
(e) Discovery of frequent episodes. 
(iii) Neural Networks — Neural networks are a new paradigm in 
computing, which involves developing mathematical structures with the ability 


io learn. The methods are the result of academic attempts to model the nervous 
system learning. Neural networks have the remarkable ability to derive meaning 
fom complicated or imprecise data and can be used to extract patterns and 
detect trends that are too difficult to be noticed by either humans or other 
computer techniques. Neural networks have broad applicability to real world 
business problems and have already been successfully applied in many 
industries. AS neural networks are best at identifying patterns or trends in data, 
they are well suited for prediction or forecasting needs. 

Neural networks use a set of processing elements correspond to neurons 
inthe brain. These processing elements are interconnected in a network that 
can then recognize patterns in data once it is revealed to the data, i.e., the 
network learns from experience just as people do. This differentiates neural 
networks from traditional computing programs that simply follow instructions 
ina fixed sequential order. 


(iv) Genetic Algorithms — Genetic algorithms are a relatively new 
computing paradigm, inspired by Darwin’s theory of evolution. A population 
of individuals, each representing a possible solution to a problem, is initially 
created at random. Then pairs of individuals group to form offspring for the 
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represents. Thus, the quality of the solutions in successive ge SOlutig 
presents, Thus, the quality of the so s In successive generation N it 


The process is terminated when an acceptable or optimum solution ig pS 
after some fixed time period. Genetic algorithms are suitable fo, rote 
that need optimization, with respect to some computable criterion ble 
This paradigm can also be applied to data mining problems Her 
quantity to be reduced is often the number of classification errors on ee €, the 
set. However, the mining of large data sets by genetic algorithms hag ae 
recently become practical because of the availability of affordable, hj gh-s Only 
computers. But there is hardly any special genetic algorithm designed Peed 


Or 
Ms 


t 
data mining problems. © sui 
(v) Rough Sets Techniques — In the field of data mining, the Ta 
sets theory has recently become a popular theory. The theory, Ugh 


introduced b 
Pawlak in the early 1980s, offers a formal framework for the automate) 
e 


transformation of data into knowledge. The rough sets theory, thoy 

mathematically simple, has displayed its fruitfulness in a number of data e 
areas. A rough set is a pair of approximations—lower approximation and ise 
approximation sets. The lower approximation is also said to be positive one 
and the upper approximation is said to be possible cases. Using thea 
approximations, the rough set theory develops tools to discover rules fro : 


ae ba > m th 
given databases. A large number of applications utilize the ideas of the theory 


(vi) Support Vector Machines — Support vector machines (SVM) 
depend on statistical learning theory and is increasingly becoming useful i 
data mining. The key concept is to nonlinearly map the data set into a high: 
dimensional feature space and use a linear discriminator to classify the data. 
Its success has been demonstrated in the areas of regression, classification and 
decision-tree construction. SVMs incorporate structured risk minimization 
which minimizes an upper bound on the generalized error. Consider a simple 
case when two sets A and B are linearly separable. The idea is to determine 
from an infinite number of planes correctly separating A and B, the one which 
will contain the smallest generalization error. SVMs selects the plane which 
maximizes the margin separating the two classes. The margin is defined as the 
distance between the separating hyperplane to the nearest point of A, plus the 
distance from the hyperplane to the nearest point in B. 


0.19. Discuss various data mining tools. 


Ans. The back-end tools and utilities are used by data warehouse systems 
to populate and referesh their data. These tools and utilities include the following 
functions — 
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a extraction that collects data from multiple, heterogeneoy 
. : 
(i) es | 
gl sour’ that finds errors in the data and re 
jor pata cleaning ata and resolve them when 
rl (ul 


transformation that translates data from legacy or host 


war lid 
to J onsolidates, summarizes, C0 
jt (iv) pos n d indices and partitio a eee Z 
ity and builds ! ns 
pieg! r: 
eos! @) eferesh that reflects the updates from the data sources to the 


chor rehouse systems usually offer a good set of data warehouse 
t tools apart from cleaning, loading, refreshing and metadata 


he quality of the data and, the data mining results, data 
d data transformation are the important steps. ; 


predictive Model — Predictive model is defined as the approach used 
awe know what we are looking for, when we can direct the data mining 
«aia ds a particular goal. The predictive models use experience to assign 
effo ome relevant outcome in the future. One of the keys to success is 


scores to s à 7 
gh data with the outcome already known to train the model. 


having enou 
0.21. What does data mining actually mean in practice ? Is it really 
ing applied or is it only hype ? Differentiate between predictive vs 
iscriptive data mining. (R.GP.V., Dec. 2004) 


Ans, Data mining is not an hype but applied really. The USA is clearly 
ahead of Europe in this regard, with large organization like American Express 
and AT & T utilizing KDD to analyze their client files. In UK, the BBC has 
applied data mining technique to analyze viewing figures. In most European 
countries, several banks and insurance companies are tentatively doing initial 
experiment with KDD. However, at the same time, it is becoming increasingly 
clear that KDD involve more problems than were initially realized. As much as, 
{0% of KDD is about preparing data, and the remaining 20% is about mining. 

It is not very difficult to data mine on an adhoc basis. Take a large file, 
assume a cluster algorithm do its job, and you will discover a control set. 


7 


ft 
Ñ 
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Several standard tools of these types of activities are already ON ty, pata characterization, by summarizing 
example, Intelligent Miner from IBM. clementine, and 4 Thou at } WY known as the target class) in gener; It ne data of the clasg 
livingstones, and new tools are announced almost every week "eh fro study i data discrimination and pli 

Difference between Predictive and Descriptive Data Minin ond? (c) ing Frequent Patterns, Association eect 
generalization is a form of descriptive data mining from g di “Ù (ii) Mo frequently in data is known as fre se a Correlations - 
perspective. Descriptive data mining describes data in a cone; h at ovr frequent patterns are given like, dema oa bcd 


; $ sth 
summarative way and presents intersting characteristics of the dat. Se ‘= paste kin 
v ata Ona pn 


other hand, predictive data mining analyzes cs in order to build one t prot res- is known as a frequent itemset. Example of fi 
of models, and attempts to predict the behaviour of new data setg, a rasa One bread. A frequently occurring subsequence ig plete, 
Q.22. What is data mining functionality ? Explain differen, is milk ae of Sagan ae i ak Pattern that customers tend to buy 
dam minine cre OE Jane raal Bean ec ates which may be ted wa ee 
Write a short note on data mining functionalities. (R.G P. VNo Í like PE known as a substructure, When a substructure pee Pam 


i esa a5 
; % Hy psequen’ structured pattern. Mining fr 
E atabases and information repositori j| erg known as a ae & trequent pattems leads to 
Ans. Various types of d p ries are availa pen t 1S Er teresting associations and correlations within data. y 


e 3 DRS ae Data ae functionalities are employed wos gove dii) Evolution Analysis - For objects whose behaviour changes over 
kind of patterns to be found in data mining tasks. Generally, data mining ae, data evolution ade describes and models regularities or trends, 
can be categorized into two categories as follows— descriptive and pre i evel, this can have c et ieee discrimination, association and 
Inference on the current data is performed by predictive mining tasks in ctii Ho nalysis, classification, pre iction, or clustering of time related 

Orgi cO features of such an analysis include time-series data analysis 


pE ies of the data are character istinct 
to make predictions. The general properties o Characterize. aata, distin oa ; reese 
descriptive mining tasks in the database. In some conditions, users may uence or periodicity pattern matching, and similarity-based data analysis. 
Li 


no idea about what kinds of patterns in their data may be interesting, He gxample — Consider that you have major stock market (time-series) data 
user may like to search for several distinct kinds of patterns in parallel. Hest ofthe last several year available from the Bombay Stock Exchange and you 
data mining system which can mine multiple kinds of patterns to accommaig would like to invest in shares of MNCs. For overall stocks and for the stocks 
different user expectations or applications must be included. Data mining Systa. of specific companies, a data mining study of stock exchange data can identify 
also should be capable to discover patterns at multiple granularity, Users, stock evolution regularities. This type of regularities may help predict future 
specify hints to guide or focus the search for interesting patterns shoot frends in stock market prices, contributing to your decision making about stock 
permitted by data mining systems. Due to some patterns may not hold fora investments, 
of the data in the database, a measure of certainty or “trustworthiness” ig usua} (iv) Outlier Analysis — A database can contain data objects. These 
associated with each discovered pattern. i data objects do not comply with the normal behaviour or data model. These 

Data mining functionalities are described below — data objects are known as outliers. Most data mining methods discard outliers 


(i) Concept/Class Description — Characterization and Diser, "5° a oA ongi the rare ae be more interesting than 
nation — Data can be associated with concepts or classes. For example, clase He MON TEU larly y RE R i Ps ications like fraud detection. 
of items for sale include mobiles and chargers, and concepts of custome agen er data13 Known a6 outlier. mining. 
include bigSpenders and budgetSpenders in the AllElectronics store. It cany 
useful to describe individual classes and concepts in summarized, concise 
yet precise terms. This type of descriptions of a class or a concept are kno 


uent iternset 
as sequential 


Example — Outlier analysis can uncover fraudulent usage of credit card 
by detecting buys of extremely large amounts for a provided account number 
as compare to regular charges incurred by the same account. With respect to 


as class/concept descriptions. These descriptions can be derived through- the location and type of purchase or the purchase frequency, outlier values can 


(a) Data discrimination, by comparison of the target class wit be deleted. ; $ ; 
$ (v) Cluster Analysis — Data objects without consulting a known 


one or a set of comparative classes (often known as the contrasting classes), f ; : 
class label, is analyzed by clustering normally, the class labels are not show in 


ve 
f 
‘ 
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a simply due to they are not known to start w ith Ch 


the training dat k 
this type of labels. The objects are 


be employed to produce Clustereg 
based on the principle of minimizing the interclass similarity andı 
W 


the intraclass similarity. Mean, clusters of objects are forme 80 
within a cluster have high similarity in comparison to one an other, bu 
to objects in other clusters. Each cluster which is formed can be ‘ t 
class of objects, from which rules can be derived. Taxonomy formatio 
facilitated by clustering mean the organization of observations into ah 
of classes that group similar events together see fig. 1.11. 
Example — It can be performed on AllElectronics customer data j 

to identify homogencous subpopulations of customers. For ina Orig) 
individual target groups may be represented by these clusters 2-dim iy 
ofcustomer data with respect to customer locations in a city, Thres 78 


lot 
of data points are evident. Center of each cluster is represented byi My 
(vi) Classification and Prediction — C lassification refers t | 

process of finding a model which defines and distinguishes data oli ly 
g a 


concepts, for the purpose of being able to use the model to predict the clas 
objects whose class label is not known. There are various other methods y 
creating classification models like naive Bayesian classification, Support ve t 
machines. and k-nearest neighbour classification. i 
Whereas classification predicts categorical labels, prediction Mode) 
continuous-valued functions. Means, it is used to predict missing or unavaila 
numerical data values as compare to class labels. However the term predicti | 
can refer to both class lebel prediction and numeric prediction. Regressg 
analysis is a statistical methodology which is most often employed for numer, 
prediction, although other procedures exist as well. Based on available dat, 
the identification of distribution trends, is also encompassed by the predictig, 
Classification and prediction can require to be preceded by Televany 
analysis, which attempts to recognize attributes that do not contribute to ty 
classification or prediction process. These attributes can then be excluded, 
Example — Consider that, sales manager of AllElectronics classifies; 
large set of items in the store, based on three kinds of responses to a sals 
campaign — good response, mild response, and no response. A model is derive 
for each of these three classes based on the descriptive features of the items 
like price, brand, place-made, type, and category. 
Suppose instead, that as compare to predicting categorical response label 
for each store item, the amount of revenue is predicted, that each item wil 
produce during an upcoming sale at AllElectronics, based on previous sale 


data. This is an example of prediction because the model constructed will 
predict a continuous — valued function, or ordered value. 
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classification of data mining system hased on — 


qas Wi model E 
o dona lities (iv) Level of abstraction of knowledge 
iii) (R.GP.V., June 2012) 
Or 
met, the major classifications of data mining system are. 
explain (R.GP.V., May 2018) 


» Classification of Data Mining System based on the Models — 
Ans @) tem can be classified according to the different data models 


minir g i al tr ansactional, relational, Or data warehouse 
° ct relation > mining 


systems: Class ification of Data Mining System based on the Types of 
classify the data mining systems based on types of data handled 
spatial data, multimedia data mining system, text, time series, 


data, ee 
4ssttea™  ¢ Web mining system. 


We may 


ining SYS See z Le f 
pata mini Fie based on data mining functionalities, like characterization, 
nation, association and correlation analysis, classification, prediction, 
dsc i outlier analysis, and evolution analysis. 
tends, where 
n of Data Mining System based on Level of 


(iv) Classificatio ate 
bstra ction of Knowledge — Data mining systems can be categorized according 


granularity or levels of abstraction of the knowledge mined, including 
Be ize d knowledge, primitive knowledge, or knowledge at multiple levels. 
ay 


0.24. List the guidelines for successful data mining. 
There are several guidelines for successful data mining as follows — 


Ans. 
(i) The overall project should have the support of the top 


management of the company. 
(ii) The positive return on investment should be realized within six 


to twelve months. 

(iii) A clear difficulty owner should be identified who is responsible 
for the project. Preferably this is not a technical analyst or a consultant but 
someone with direct business responsibility, e.g., someone with experience in 
asales or marketing environment. This will profit the external integration. 

(iv) It is suggested that a small pilot project be carried before 
beginning a major data mining project. This can involve a steep learning curve 
for the project team. This is of vital importance. 

(v) Data mining projects should be carried out by small teams with 
astrong internal integration and a loose management style. 


a ae 


F 
i 
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and database is shown in fig. 3.4. 
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Q25. Is the data warehouse a prerequisite for data Min 


Data Staging Area 


Flat Documents 
with Extracted 
and Transformed 
Data 


Any Documents 
Files Ready for 
Loading the 
Data Warehouse 


Data 
Warehouse 


Data Extraction 


Data Mining 


Data Cleaned, Extrocted 
and Transform for Mining 


Fig. 3.4 Relationship between Data Warehouse and Data Mining 


In general, a data warehouse comes up with query optimisation and acces 
techniques to retrieve an answer to a query — the answer is explicitly in the 
warehouse. Some data warehouse systems have built-in decision-suppot 
capabilities. They do carry out some of the data mining functions, like 


B, minin 
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consider a query like “How man 


VE : ir | ample, y BMWs we 
dara warehouse helps data mining ? Ifso in what ways ? ( Rapp 2 th | potions: Eo The answer can clearly be in the data w arehouse, Hodge 
Ans. A practicable data warehouse, although not a Prerequj 2 ‘ay N pe don a ‘ike “How many PIWS BO you think will be sold in London oe 
a practical boost to the data mining process. Site, Wil} x 1) in Qe ewer may not explicitly be in the data warehouse. 1 Using certain 
r . : nig n i P 
Data mining fits well and plays a major role in the data Wate i 0" the a techniques: the ae, ae Ws in London can be 
data mining the bedrock is formed by the clean and complete wi Ou rm ee ee d then the ques ion : a vere } 
the data warehouse enables data mining operations to take Si wl «cover " rsa. compromise technique ? What’s level of granularity you require 
technologies support each other in this relationship. The major fai Bo, Sige data warehouse. Unless its a huge burden to Stay careful data at 
relationship are as follows — ots Of ts) pg evel of granularity, attempts to store Pee data. Otherwise, for data 
(i) Data miningalgorithms require more data, at the elabo pe OW I agements we ile A s oes oe directly from the 
At the lowest level of granularity data warehouse have data. rate ley ional systems. This ne 7 a cleansing, consolidation 
ii ini i a tion. 
(ii) Data mining prospers on integrated and cleansed data ope ansforma ie een eht SAMAA 
is very suitable for data mining when ETL functions were carried sue tay m ay also T ed R 4 H a8 s: mion for 
z ri . 
(iii) With powerful relational data base system and parallel 7 `: Rel i queries: ue Tar a systems y sets of dimensions 
, 7 i i i S Essi 0 ide with 10 - f 
ARTA oer aie te aaa et oe chee ag other investne may oe mining operation, the data warehouse is an important and easily to 
s required to help da ar 5 ae 
aa i p 8 ware is already A e K extraction that mining tools work on comeback from the data 
È £ aval Lee 
While a data warehousing system formats data and organise arehhouse- 
4 ee sd ware 
support management functions, data mining attempts to extract a 0.26. Write short note on Ls vs DM. 
information as well as predicting trends and patterns from the data. Nowe a Te i 
a data warehouse is not exclusive for data mining; data mining can be ang write the differences and similarities between DBMS and DM. 
out in traditional databases as well. However, because a data warehouse conn ed 6 (R.GP.V., Dec. 201 0) 
quality data, it is highly desirable to have data mining functions incorpory : : F z d oe 
in the data warehouse system. The relationship between warehousin i What are the different ways of interfacing a data mining system with 


sabase systems ? (R.GP.V., June 2011) 

NE DBMS assists query languages that are useful for query triggered 
É ji exploration, while data mining assists automatic data exploration. If we 
‘now precisely what information we are seekin g,a DBMS query would suffice, 
However, if we not clearly know the possible correlations or patterns, then 
jata mining techniques are useful. One of the tasks of data mining is hypothesis 
testing, in which we formulate a hypothesis and test it by shifting through the 
database. This task can be managed by a DBMS query. Thus, in these senses, 
DBMS assists some primitive data mining tasks. 


From the architectural viewpoint, there are three different manners in which 
data mining systems utilize relational DBMS. They may not use it at all, be 
loosely coupled or tightly coupled. 

A majority of data mining systems do not use any DBMS and have their 
ownmemory and storage management. They treat the database simply as a data 
repository from which data is expected to be downloaded into their own memory 
structures, before the data mining algorithm begins. These systems ignore the 
field-proven technologies of DBMS, like recovery, concurrency, etc. 


| 
; 


3 
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The second approach is to have a loosely-coupled DBMs me 
DBMS is used only for storage and retrieval of data. For instanc ae Condi 
a loosely-coupled SQL to fetch data records as needed by the; “sOn 
The front-end of the application is implemented in 4 


select statement to retrieve the set of records of interest from the q 
loop in the application program copies records in the result set o abas 
from the database address space to the application address wa 
computation is carried out on them. This loosely-coupled approac $ 
use the querying capabilities offered by the DBMS. d 


In tightly-coupled approach, the portions of the application 
selectively pushed to the database system to perform the essentia] 
In the database, data are stored and all processing performed at 
end. It is different from bringing the data from the database to the data 
area. Whereas, the data mining application goes where the data naturally gat 
This avoids performance reduction and takes full benefit of database tees a 
This approach performance relies on the way to optimize the data mie 
process while mapping it to a query. ining 


0.27. Write short note on “KDD”. (R.GPV, Dec, 2 ny 
k ) 


Ans. The term ‘KDD’ (Knowledge Discovery in Database) be empio 
to describe the complete process of extraction of knowledge from data, fie 
knowledge means relationships and pattern between data elements. It a 
further proposed that the term ‘data mining’ should be used exclusively ie 
discovery stage of the KDD process. A more less official definition of KDD, 
~ the nontrival extraction of implicit, previously unknown and potentially Usefi 
knowledge from data, so the knowledge must be new, not obvious, and One 
must be able to use it. KDD is not a new technique but rather multidisciplinay 
way field of research; machine learning, statics, database technology, expen 
system, and data visualization all make a contribution. 

American express and AT & T utilizing KDD to analyze their client file 
In the most European counteries, several banks and insurance companies ay 
tentatively doing initial experiments with KDD. At the same time, however, 
is be becoming increasingly clear that KDD involves more problems than wer 
initially realized. As much as 80% of KDD is about preparing data, ani 
remaining 20% is about mining. Manipulating data using normal databas 

routines in order to clean or code it is a much more important part of the KD) 
process than the actual pattern recognition itself, 


0.28. How is data mining different from KDD. 
Or 
Write differences between KDD vs DM. 


eA 


pr Ogram 
“PMP Utati, 
the databy, 


$ ate 


(R.GBP.V., Dec. 2010 
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mining is one of the steps involved in knowledge discovery in 


; the knowledge discove 
AnS rhe different steps in the knowledge discovery process are data 


ning and preprocessing, data transformation and reduction 


ssıng and the 
SS tends to be 
'k up from 
on towards 
o arrive at 
ith a set of 
he structure 
is obtained, 
he large data 


soceptable computational efficiency limitations. Thus, the structures that are 
the res 
these can 
_ 9.29. Explain KDD briefly. Distinguish between KDD and data mining. 
(R.GP.V., June 2011) 

Ans. KDD — Refer to Q.27. 


Differences between KDD and DM — Refer to Q.28, 


0.30. What do you understand by data mining ? Discuss the role of 
data mining as a step in knowledge discovery process.(R.GP.V., June 2009) 

Ans, Data Mining — Refer to Q.1. 

Steps in Knowledge Discovery ~ Some people view data mining as a 
synonym for another popularly used term, Knowledge Discovery from Data, 
or KDD. While, others view data mining as an important step in knowledge 
discovery. process. Fig. 3.5 shows knowledge discovery as a process and it 
consists of an iterative sequence of the following steps - 

` (i) Data Cleaning — In this step, noisy and inconsistent data are 
removed, 


_ (ii) Data Integration — \n this step, multiple data sources may be 
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(iii) Data Selection ~ In this step, data related to tig 
are retrieved from the database 

(iv) Data Transformation A In this step, data are tp 
consolidated into forms suitable for mining. 

(v) Data Mining- It is an essential process in which S 
r to extract data patterns. 


m } 
e tee 
ti 


methods are used in orde 
(vi) Pattern Evaluation — In this step, identifies the try} | 


is of Yin | 
patterns representing knowledge on the basis of some instrest; ngness rash 

ji N Cag 
(vii) Knowledge Presentation — In this step, Visualizatiy Urey | 
knowledge representation techniques are applied to provide ie D ang 
the user. Mingy 

knowledge to the user. 
ing and 
unn SERAT Selection and 
Transformation Evaluation 


aE J eaae an 


Presentation 


| 


Patterns 


errs 


Se ee 3 


Databases $ 
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Fig. 3.5 Knowledge Discovery Process 


The data are prepared for mining in steps (i) to (iv), which are differe 
forms of data preprocessing. The data mining step may interact with the a 
ora knowledge base. The interesting patterns are provided to the user and a 
be kept as new knowledge in the knowledge base. As per this view, data minin 
is only one step in the whole process, although an important one since it TEVeals 
conceal patterns for evaluation. We agree that data mining is a step in the 
knowledge discovery process. Although the term data mining is becoming 
more popular in comparison to the longer term of knowledge discovery fron 
data in media, in industry, and in database research background. 


0.31. What are the differences between — 
(i) Data mining versus query tools 
(ii) Data mining and machine learning 
(iii) Relational DBS and spatial DBS 


(iv) Data mining versus knowledge discovery. 
(R.GP.V., June 2013) 


Ans. (i) Data Mining Versus Query Tools — Data mining tools and query 
tools are complementary. A data mining tool does not replace a query tool, buti! 
provides the user a lot of additional possibilities. For example, consider a large 
file that contains millions of records that describe your customer purchases of 
the last ten years. In such a file, there is a large amount of useful knowledge 
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found by using normal queries at the database 
nbe fumover in a certain sales region in July ? 
averse ate ?, etc. However, there is knowledge hidden In your da 

rmine using SQL. For example, what are the most im ties 
ae behavior ? or w hat is an optimal segmentation of my ae s 
se questions could be answered using SQL. You could t a 
aif some defining criteria for customer profiles and que ue 
ether they work or not. In a process of trial and aae 

, develop intuitions about what the important distinguishin 
n this way, it would take days or month to get an oot: 


For instance, what 
» who bought which 


swer in a $ ; ! 
nd He $ tool, you can use your query environment again to query and analyze 


nd. One could say that if you know exactly what you are lookin 
rwise use data mining. g 


Data Mining and Machine Learning ~ KDD or data mining is 
ith finding understandable knowledge, whereas machine learning 


(i) 


Ho 
are ve! 


databases, ; 
(iii) Relational DBS and Spatial DBS — The main difference between 


ining in relational DBS and in spatial DBS is that attributes of the 
neighbors of some object of interest may have an influence on the object and 
therefore have to be considered as well. The explicit location and extension of 
spatial objects define implicit relations of spatial neighborhood which are used 
by spatial data mining algorithms. Therefore, new techniques are required for 
effective and efficient data mining. 

(iv) Data Mining versus Knowledge Discovery — Refer to Q.28. 


data m 


0.32. What are the various stages of knowledge data discovery ? Explain 
each stage with suitable example. (R.GP.V., Dec. 2002) 


Or 
What are the steps involved in data mining ? (R.GP.V., Dec. 2003) 
Or 
What are the steps involved in data mining ? Explain them. 
at (R.GP.V., June 2008) 
Or 
Describe the steps involved in data mining when viewed as a process of 
knowledge discovery. (R.G.P.V., Dec. 2009) 


| 
: 
| 
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das, The stages of KDD are as follows 


@ Selection — This stage is related with sele 


: Cting or 
the data that are relevant to some criteria, For example, for credit ee 


profiling, we extract the type of transactions for each type of 
we may not be interested in details of the shop where the trans Ers 


Smenjj 


(ii) Preprocessing — Preprocessing is the data cleanin 
superfluous information is removed. For example, it is unnecesg 
sex of a patient when study about pregnancy. When the data is dra 
different sources, it is possible that the same information is foung in vs fron, 
sources in different formats. This stage reconfigures the data to na Lffereny 
consistent format, as there is a possibility of inconsistent formats. Sure a 


g Stage 
Whe 
ary to Note i. 
e 


(üii) Transformation — The data is not merely transferred 
transformed in order to be appropriate for the task of data mini 
stage, the data is made usable and navigable. 


across, but 
ng. In this 


(iv) Data Mining — This stage is concerned with the extract 


patterns from the data. 10n of 


(v) Interpretation and Evaluation — The patterns found in th 
mining stage are converted into knowledge, which in turn, is used t 
decision-making. 

(vi) Data Visualization — Data visualization makes it possible for the 
analyst to gain a deeper, more intuitive understanding of the data and as such 
can work well alongside data mining. Data mining permits the analyst to focus 
on certain patterns and trends and explore them in depth employing Visualization 
On its own, data visualization can be overwhelmed by the volume of data ing 
database but in conjuction with data mining can help with exploration. 

Data visualization assists users to examine huge amount of data and detect 
the pattems visually. Visual displays of data like maps, charts and other graphical 
representations permit data to be presented compactly to the users. A single 
graphical screen can encode as much information as can a far larger number 
of text screens. For example, if a user wants to find out whether the production 
problems at a plant are correlated to the location of the plants, the problem 
locations can be encoded in a special color, say red, on a map. Then, the user 
can find the locations in which the problems are occurring. He may then form 
a hypothesis about why problems are occurring in those locations, and may 

verify the hypothesis against the database. 


e data 
0 help 


0.33. Can you briefly describe the four stages of knowledge discovery 
(KDD) ? Can you describe the multi-tiered data warehouse architecture? 
(R.GP.V., May 2019) 

Ans. Refer to Q.32 and Q.6 (Unit-I). 
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major issues in data mining. 


(R.GP.V., Dec, 2008, June 2010) 
Or 


mining methodology, 


(R.GP.V., Nov./Dec 
st Or ov./Dec. 2007) 
the major challenges in data mining regarding data mining 


and performance ere (R.GPYV., June 2009) 


hallenges to data mining regarding data mining methodology 


tion issues. 5 (R.GPV., June 2010, 2017) 
r 


i and challenges in data mining. (R.G P 
piscuss the issues 5 8. (R.GP.V., Dec. 2011) 


issues in data mining. 
Or 

explain major issues and challenges of data mining. 

(R.GP.V, Nov, 2019) 


cuss various (R.GPV., June 2015) 


Dis 
Briefly 
Ans. Major issues in data mining are as follows — 

G) Mining Methodology and User Interaction Issues — These 
eyes reflect the ability to mine knowledge at multiple granularities, the types 
issu tige mined, ad hoc mining, knowledge visualization, and the use of 

(a) Interactive Mining of Knowledge at Multiple Levels of 
Abstraction — The data mining process should be interactive because it is 
dificult to know exactly what can be discovered within a database. Interactive 
‘mining permits users to focus the search for patterns, providing and refining 
dala mining requests on the basis of returned results. In this way, the user can 
interact with the data mining system to view data and discovered patterns at 
multiple granularities and from different angles. 


; (b) Mining Different Types of Knowledge in Databases — 
Data mining should cover a wide spectrum of data analysis and knowledge 
discovery tasks since different users are interested in different types of 
knowledge. These tasks may use the same database in various ways and need 
the development of numerous data mining techniques. 


“zt 


_ (c) Data Mining Query Languages and Ad hoc Data Mining- 
Bymaking easier the specification of pertinent data sets for analysis, the domain 
| knowledge, the types of knowledge to be mined, and the conditions and 


i 


ii a 
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constraints to be applied on the discovered patterns high ley 
query languages need to be developed to permit users to dese 
mining tasks. This type of language should be combined wj 
data warehouse query language and optimized for efficient 


mining. 
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(d) Presentation and Visualization of Data Minin 
To understand and use the knowledge, easily discovered knowleq 
expressed in high-level languages, visual representations, or other i 
forms so that the knowledge can be easily understood and directly ‘Pres, 
humans. This is critical if the data mining system is to be interact t h 
needs the system to use expressive knowledge representati on techniq Thi 
rules, tables, graphs, charts, trees, crosstabs, materices, or curves es, lik 

(e) Incorporation of Background Knowledge — p 
knowledge, or information regarding the domain under study may 
guide the discovery process and permit discovred patterns to be rep 
brief terms and at different levels of abstraction. 

(f) Handling Noisy or Incomplete Data — The data sta 
database may reflect noise, incomplete data objects or exceptional cases The 
objects may confuse the process when mining data regularities and lea Ey 
the knowledge model constructed to overfit the data. Consequently, the gon b 
of the discovered patterns can be worse. Data analysis and data east 
methods are required to handle noise, as well as outlier mining exception 
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l 
ae shou k 


ack grog 
be Useq 4 
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çases. 
(g) Pattern Evaluation — A data mining system can reve 


number of patterns. Most of them discovered may be uninteresting to the giy 
user, either because they represent common knowledge or lack novelty, Th 
use of interestingness measures or user-specified constraints to guide th 
discovery process and limit the search space is another active area of research 


(ii) Issues Relating to the Diversity of Database Types — 


(a) Handling of Relational and Complex Types of Data- 
The development of efficient and effective data mining systems for relation 
databases and data warehouses is important because such data are widely used 
Although, other databases may contain complex data objects, hypertext ani 
multimedia data, spatial data, temporal data or transaction data. Given the 
variety of data types and different goals of data mining, it is unrealistic t 
expect one system to mine all types of data. Specific data mining systems 
should be constructed for mining specific types of data. Hence, one may expect 
to have different data mining systems for different types of data. 
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) Mining Information from Heterogencous Databases and 
ion Systems — The knowledge discovery from y 


formati anant sources 


i semistructured, or unstructured data with diverse data semantics 

F hallenges to data mining. Data mining may help to reveal high- 
rities in multiple heterogeneous databases that are unlikely to 
da ; query systems and may enhance information exchange 
ity in heterogeneous databases. In data mining, Web mining 
ry challenging and fast evolving field. Web mining uncovers 


Performance Issues — These issues include efficiency, scalability 


flelization of data mining algorithms. 
ara 


and P (a) Parallel, Distributed and Incremental Mining Algorithms — 
gorithms split the data into portions for Processing in parallel, 

the results obtained from the portions are merged. Besides, the 
ment data mining algorithms promoted by the high cost of some 
esses. Data mining algorithms include database updates 
he whole data again “from scratch”, 


(b) Efficiency and Scalability of Data Mining Algorithms — 

ia mining algorithms must be efficient and scalable to effectively extract 
Da mation from a large amount of data in databases. The key issues in the 
Se ition of data mining systems are efficiency and scalability from a 
aaa point of view on knowledge discovery. 


ZER 
3 


ME INTRODUCTION TO FUZZY SETS AND FU 


Soe 


o 


0.35. Write a short note on fuzzy sets. 
Ans. Fuzzy sets support a flexible sense of membership of elements to a 
set, In fuzzy set theory many degrees of membership between 0 and | are 


allowed. Thus, a membership function po is associated with a fuzzy set P 


such that the function maps every element of the universe of discourse X to 
the interval [0, 1]. 

The mapping is represented as Hp(x): X — [0, 1]. 

A fuzzy set P defined on X, where X is a universe of discourse and x is a 
particular element of X, may be written as a collection of ordered pairs 


P {(x, Hp(X)), xeX} wi) 
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where each pair AS UpC(X)) is called a singleton. An alte uzzy logie ? Explain its importance. 
| ees a fuzzy set as a union of all 15 (x)/x singletons ¢ defini, o crisp logic, the truth values of propositions are 2-valued, 
i Written N we which may be treated as equivalent to (0, | ). But, in fuzzy 
| p= > us (xi)/xi > Fa proposition are multivalued such as absolutely true, partly 
| x eX “iy Beals very true and so on and are equivalent to (0-| ). 
and for continuous case P = f up (X)/x proposition is a statement pach pours e fuzzy. truth valve 
: i 5 10 bea ma ea. y =. the truth value (0-1 ) 
i SS onnlest form, itions are assoc 
Q.36. What are the different properties of fuzzy sets ? po E Sahin value associated with the airy se 
: Ans. Fuzzy sets follow some of the properties Satisfied by cris he, truth value T( P). That is, 
fuzzy set P is a subset of the reference set X. Also the tember Any as the D = Hx (x) where 0 < u A(x) S1 
element belonging to the null set is 0 and the membership of an 'P of any T ) ti d . 
belonging to the reference set is 1. y are the major Ree sarge logic 


elemen . 
t [lowing z a T A 
The properties satisfied by fuzzy sets are — z (qi) Negation ©) (ii) Disjunction (v) 


; F E Ot Pias iv) Implicati 
G) Associativity — PA(QAR)=(PAD)AR (iii) Conjunction D (iv) np ication Bee 
TANE Se aes semantic OF meaning of fuzzy connective is shown in table 3.2. Here 
á F BY (Q a) oe UQ)UR E ae propositions and T(P), T(Q), are their truth values. 
Gi) Commutativity — PrQ=Onb 0% Table 3.2 
PUQ=QuUP 
(iii) Distributivity — PA(QUR)=(PAO)U(PAR = 
PO ONREG a TORI Negation 1- T(P) 
atin PU(QOR)=(PUQO (PUR) Disjunction max(T(P), T(Q)) 
(iv) Identity — PU X= x Conjunction _ min(T(P), T(Q)) 
PUd=P Implication P v Q = max(1- T), TQ) 
PAX=P 
mae | ani related by the ‘=’ operator are called antecedent and consequent 
fells Fi 4 spectively. Here too, ‘=>’ represents the IF-THEN statement as 
(v) Idempotence - z ce = 3 Fxis P THEN y is Q, and is equivalent to 
OrF= a, Men eaa ~ 
(vi) Involution — (P°)° =P Dez Q ee) 
(vii) Transitivity — If P c 6 CR, then BER The membership function of R is defined as 
(viii) DeMorgan’s Laws — HR (X,Y) = max(min(u5 (x), H6(y)), 1- Hp (x) 


PUD? = PQ) 
(PAQS = PE uG?) 
since fazzy sets can overlap, the laws of excluded middle do not hold good. Thus, 
f PUP +X 
PaP 46 


Also, for the compound implication IF x is P THEN y is Q ELSE yis R 
| then the relation R is defined as : 
e $ R = (Px Q)U(P«R) 


rship function of R is defined as 
(x,y) = max(min(y5(x), Ho(y)), min(l - 15 (x), HR (Y) 
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0.38. Bri efly explain two applications of fuzzy Systems, 


Ans Fuzzy logic finds applications in many systems, One of th 
application of fuzzy logic is the Sendai Subway system in Sendai, a fa 
control of the Nanboku line, developed by Hitachi, used a fuzzy a 
run the train all day long. This made the line one of the smoothes To 
subway systems in the world and increased efficiency as well aş stopp; 
The most tangible applications of fuzzy logic control have appeared co 
appliances- Specifically, but not limited to heating ventilation a 
conditioning systems. These systems use fuzzy logic thermostats a d ai, 
the heating and cooling, this saves energy by making the system More me 
It also keeps the temperature more steady than a traditional thermostat Clone 
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LEARNING, CLASSIFICATION — STATISTICAL 
ORITHMS, DISTANCE BASED ALGORITHMS 
ISION TREE BASED ALGORITHMS) 


Base: 


ao 
1, Write a short note on supervised and unsupervised learning with 
0. (R.GP.V., Nov. 201 9) 


eae discriminate analysis which occurs in statistics. 
simia 


Features 
Vectors 


Sound 


New Text 
Document, Features p Expected 
Image, Vector Label 


Fig. 4.1 Supervised Learning 
Supervised learning deals wiih learning a function from available training 
data, A supervised learning algorithm analyzes the training data and produces 
an inferred function, which can be used for mapping new examples. Common 
examples of supervised learning include — 
(i) Classifying e-mails as spam 
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i) Labeling webpages based on their conte 
(i) Voice recognition. 
There are many examples of Supervised learning a 
(support vector machines), Naive Bayes classifiers 
decision trees, 


nt 


lgorithm 
= S 
S, neural my SVM, 


Unsupervised Learning — Unsupervi 
Supervised learnine ; 
observation and discovery. In this mode of learning, ee se ‘eamin 


PE REE yzes the give 
observe similanties emerging out of the subsets of the data The 4 


of class descriptions, one for each class, discovered in the envi 
Similar to cluster analysis in statistics. ac 
Unsupervised learning makes sense of unlabeled d 
predefined data set for its training. Unsupervised eee 
powerful tool for analyzing available data and look for patterns ps 
1s most commonly used for clustering similar input into logical grou 
approaches to unsupervised learning include — 5 
(i) k-means 
(ii) Self-organizing maps 
(iii) Hierarchical clustering. 
Some popular examples of unsupervised learning algorithm are — 
(i) Genetic algorithms 
(ii) Clustering approaches 
(ili) Apriori algorithm for association Gule learning problems, 


Set of 
Utcome isa k 
onment. This : 

$ 


having any 
Extreme 
d trends, : 
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Better 
Representation 


Sound... 


Fig. 4.2 Unsupervised Learning 


Q.2. Describe the problem and issues in supervised learning. 


Ans. There are six issues taken into account while dealing with supervised 
learning as follows — 


(i) Heterogeneity of Data — Many algorithms like neural networks 
and support vector machines like their feature vectors to be homogeneous 
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lized The algorithms that employ distance metrics are very 
ae f the data is heterogeneous, these methods should 


fi es can handle heterogeneous data v 2 
Wet vat Decision trees Ca g data very easily, 
0 


dancy of Data — If the data contains redundant information, 

Redun rrelated values, then it’s useless to use distance based 

j highly A numerical instability. In this case, some sort of 
ployed to the data to prevent this situation. 


endent Features — lf there is some dependence between the 
(iid) Dep gorithms that monitor complex interactions like neural 


a 
Een trees fare better than other algorithms. 


ros pias Ve ‘ance Tradeoff — The training set may contain several 
wally g ood data sets. Now the learning algorithm is said to be 

gerent but equa" lar input if, when trained on these data sets, it is 
od fo incorrect while predicting the correct output for that particular 

s ematical ‘ag algorithm has a high variance for a particular input when it 
uts when trained on different data sets. Thus, there is a 


povides di een bias and variance and supervised learning approach is able 


PE. mount of Training Data and F unction Complexity — The 
f data required to provide during the training period depend on the 
mount © the function required to map from the training data set. So, for 
function with low complexity, the learning algorithm can learn from 
mall amount of data. Whereas, on the other hand, for high complexity 


2> ong, the learning algorithm needs large amount of data. 


ad : 

functio (ip Dim ensionality of the Input Space -If the input feature vectors 
high dimension then the learning algorithm can be difficult even it depends 

fave small number of features. This is because the many “extra” dimensions 

asain the learning algorithm and cause it to have high variance. Hence, 

tigh input dimensionality typically requires tuning the classifier to have low 


pan Rar rit 


0.3. Write short note on classification. (R.GP.V., June 2013) 


Ans. A bank loans officer needs analysis of her data in order to learn 
which loan applicants are “safe” and which are “risky” for the bank. A marketing 
anager at AllEectronics needs data analysis to help guess whether a customer 
witha ven profile will buy a new computer. A medical researcher wants to 
aalyze breast cancer data in order to predict which one of three specific 
treatments a patient should receive. In each of these examples, the data analysis 

k is € a ification, where a model or classifier is constructed to predict 
rical labels, such as “safe” or “risky” for the loan application data; 


ware 


(aaa 


S j i 
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‘on — Regression is used to map a data item to a real 


Q.4 How does classification work ? » Regres” le. In actuality, regression involves the learning of the 
Ans. Data classification is a two-step Process, as sho O tion yariab Fooie Regression assumes that the target data fi 
application data of fig. 4.3. In the first step, a classifier jg ii for the} aed P at does Bee act on (e.g. linear, logistic, etc.) and then determines 
predetermined set of data classes or concepts. This is the PIR t descrip ation known type ° type that models the given data. Some type of error 
a classification algorithm builds the classifier by analyzing or E Step, whe’ go notion of ee which function is “best”. Standard linear is 4 
a training set made up of database tuples and their associated aiig foa he pes! js used t0 “pai on. For example, a college professor wishes to reach 
tuple, X, is represented by an n-dimensional attribute vector, ven labet al ample oa s before his retirement. Periodically, he predicts what 
x), depicting n measurements made on the tuple from n databage oh onl jevel of lee be based on its current value and several past values. 
respectively, A; Ao, ~., Ap. Each tuple, X, is assumed to belong to G attributes yoo ment SAVES r regression formula to predict this value by fitting past 
class as determined by another database attribute called the class i Predetings nist a simple i con and theri using this function to predict aaa 
The class label attribute is discrete-valued and unordered, It is ae attribue fle ee to las sed on these values, he then alters his investment portfolio 
that each value serves as a category or class. The individual tuples ec in pt gin the future. a : A . ae 
the training set are referred to as training tuples and are selected gent up ato! ii) Bayesian Classification - Bayesian classifiers are Statistical 
database under analysis: OM the They can predict class membership probabilities, such as the 
Js sifiers- hata given tuple belongs to a particular class. 
Classification Algorithm povabili classification is based on Baye’s theorem. Studies comparing 
payesian rithms have found a simple Bayesian classifier known as 


the ma tected neural network classifiers. Bayesian classifiers have also 

re w gh accuracy and speed when applied to large databases. 
yis ae ay ie ae Bayesian classifiers assume that the effect of an attribute value on 
poe SR SAE ae on class iS independent of the values of the other attributes. This assumption 
ore) aide ie ats Classification Rules agiven onditional independence. It is made to simplify the 


Susan Lake senior low safe 
Claire Phips senior medium safe 
Joe Smith middle aged high safe 


ig called class © A : aoe 
B“ ations involved and, in this sense, is considered ‘naive’. Bayesian belief 
computat 


IF age = youth THEN Joan. decision = ry works are graphical models, which unlike naive Bayesian classifiers, allow 
ae tation of dependencies among subsets of attributes, Bayesian belief 


IF income = high THEN loan decision = 
IF age = middle_aged AND income = ig 


THEN /oan_decision = risky 


Fig. 4.3 Learning : Training data are Analyzed by a Classification Algorithm 


The first step of the classification process can also be viewed as the learning 
of a mapping or function, y = f(X), that can predict the associated class label y 
of a given tuple X, In this view, we wish to learn a mapping or function that 
separates the data classes. Typically, this mapping is represented in the form 
of classification rules, decision trees, or mathematical formulae. In our example, 
the mapping is represented as classification rules that identify loan applications 
as being either safe or risky as shown in fig. 4.3. The rules can be used to n coefficients, w and b, can also be thought of as weights, so that we 
categorize future data tuples, as well as provide deeper insight into the database | ; 
contents. They also provide a compressed representation of the data. EE y= Wot wyx. 


m0. 6, Discuss straight-line and multiple linear regression analysis. 
ey Ane Straight-line regression analysis involves a response variable, y, and 
ingl predictor variable, x. It is the simplest form of regression and models 
linear function of x, That is, 

ES y= b+wx 
variance of y is assumed to be constant, and b and w are regression 


0.5. What do you mean by statistical based algorithms ? 


* isti i be ised in two types, | ae Be 
See eee See eee eres In O-PS la and the estimate of the line. Let D be a training set consisting of 


ar 


ee ea SA 


Pity EES 
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: each tuple as a point in an n-dimensional space, 


values of predictor variable, x, for some population and thei consider ; 
à ae $ cir an a estimate the a : Be: 
for response variable, y. The training set contains |D] data pointed vi es can be used to ae ate isi pene of each point in a 
N l ze scretized a >Q aCe. 7- 
ENEAS Yak Qp Yip): The regression coefficients sae of the N nAi space for a set of c hea a yi pee ae on a smaller 
using this method with the following equations — n be estim A : sional combinations. This allows a nigher-dimensional data 
Di ateq ¡men tructed from lower-dimensional spaces. Log-linear models 
zi s ns a R rea Bar a ; ; 
Zai -X yi -y) o useful for dimensionality reduction and data smoothing. 
IRS dloglinear models can both be used on sparse data, although 
1 ID) Se, ssion an By be limited. While both methods can handle skewed data, 
2; (xj — X) ‘ceptionally well. Regression can be computationally intensive 
oa ae high-dimensional data, whereas log-linear models show good 
Wo = y- wX pto 10 or so dimensions. 
here X i 7 E O yi for U , ; Hid 
where X is the acan value of X;, X2, ..-, Xipj and y is the mean Value of ted the simple linear regression and multiple linear regression 
.-. Yip The coefficients wo and w; often provide good approxima, °? a 
d . - . 10 ‘ : R 
ioe ie opica E $ Shoe mt i ear regression, the model specification is that the dependent variable, 
Multiple linear regression is an extension of straight-line regression par pination of the parameters (but need not be linear in the independent 
to involve more than one predictor variable. It allows response Variable oe a fs ee ole, in simple linear regression for modeling n data points 
modeled as a linear function of, say, n predictor variables or attributes A Obe : t ependent variable — xj, and two parameters, By and B} — 
.-- Ap describing a tuple, X. Our training data set, D, contains data of then si Rees Se th 
CAN 0. CR apy ass A , where the X; are the n-di : om Sly) Pea ; : 
(% yı) ( 2> y2) (Xp; Yp) i dimensional trainin e linear regression, there are several independent variables or 
tuples with associated class labels, y;. An example of a multiple linear regressi dependent v Des. 
n 


model based on two predictor attributes or variables, A, and A, is 
y = Wo + W1X] F WX), Adding a term 1n Xj 
y,= Bo + BiXi * Box? + Ej {ial arn 
1 5 . . 
| linear regression; although the expression on the right hand 
c in the independent variable x,, it is linear in the parameters 


where x, and x are the values of attributes A and A3, respectively, in x. 
Q.7. Explain regression and log-linear models. This is stil 
side is quadrati 


and Bz. 
h i both cases, £; is an 


observation. ; 3 : 
Returning our attention to the straight line case — Given a random sample 


fom the population, we estimate the population parameters and obtain the 
sample linear regression model — 


hve Bo + Bixi 

“The residual, e; = y;—¥;, is the difference between the value of the 
dependent variable predicted by the model, y;, and the true value of the 
dependent variable, y; One method of estimation is ordinary least squares. 
This method obtains parameter estimates that minimize the sum of squared 


residuals SSE, also sometimes denoted RSS; 
x 3 n 
ee SSE= X'e? 


Ans. Regression and log-linear models can be used to approximate the | 
given data. In (simple) linear regression, the data are modeled to fit a straight 
line. For example, a random variable, y (called a response variable), can be | 
modeled as a linear function of another random variable, x (called a predictor 
variable), with the equation 
y= wxtb 
where, the variance of y is assumed to be constant. In data mining, x and y are 
numerical database attributes. The coefficients, w and b are called regression 
coefficients, they specify the slope of the line and the Y-intercept, respectively. 
These coefficients can be solved for by the method of least squares, which | 
minimizes the error between the actual line separating the data and the estimate | 
of the line. Multiple linear regression is an extension of (simple) linear 

regression, which allows a response variable, y, to be modeled as a linear 
function of two or more predictor variables. 

Log-linear models approximate discrete multidimensional probability 

distributions, Given a set of tuples in n dimensions (e.g., described by 1 


error term and the subscript i indexes a particular 
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Minimization of this function results in a se 
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2 Bayesian classification called Naive ? 
y S 


tof noma 


x . : a l e y at Ae i 
simultancous linear equations in the parameters, which A sof ons : N oasi "ers astume that the effect of an attributi 
: ar Cd tooo e aye A 3 3 j ; ; 
parameter estimators, Bo. 8; - to Viel W independent of the values of the other attributes. This 
he „Jass conditional independence. It is made to simplify 


and in this sense is called Naive. 


In the case of simple regression, the formulas for the least 
ar- Squan 


> 
$ 
b 


E £: Ese 
x SN -DO — Y) Stima, 


Acce g o =y-Â 


Fi r = ro 
where X is the mean (average) of the x values and Y is the Mean of th these P 
e 


is Baye’s theorem ? Describe basic probability notation. 
pabilities estimated ? 
data tuple. In Bayesian terms, X is considered “evidence”. 


Under the assumption that the population error term h Y valie da ed by measurements made on a set of n attributes. Let H 
variance, the estimate of that variance is given by — as a Constan desca such as that the data tuple X belongs to a specified class 
SE t por problems, we want to determine P(H/X), the probability 

O; = yy ae holds given the “evidence” or observed data tuple X. In 


This is called the mean square error (MSE) of the + eck 
denominator is the sample size reduced by the number of model n. The ow the attribute description of X. 
estimated from the same data, (n — p) for p regressors or (n — pameten posterior probability, or a posteriori probability, of H 
intercept is used. In this case p = 1 so the denominator is n — 2. l) if ay x. For example, suppose our world of data tuples is confined 
The standard errors of the parameter estimates are given by edon d by the attributes age and income, respectively, and that 
r-old customer with an income of $40,000. Suppose that H is the 
a ‘omer will buy a computer. Then P(H|X) reflects the 


i the probability that tuple X belongs to class C 
egressio are looking for the p y p gs to class C, 


at, that S “ye 
mee any other information, for that matter. The posterior probability, 


| on is based on more information than the prior probability, P(H), which 
independent of X. 


similarly, PXH) is the posterior probability of X conditioned on H. That 
i itis the probability that a customer, X, is 35 years old and earns $40,000, 
sven that we know the customer will buy a computer. 

P(X) is the prior probability of X. Using our example, it is the probability 
that a person from our set of customers is 35 years old and earns $ 40,000. 
P(H), PCX/H), and P(X) may be estimated from the given data, as we shall 
ve below. Baye’s theorem is useful in that it provides a way of calculating the 


posterior probability, P(H|X), from P(H), P(X|H) and , P(X). Baye’s theorem is 


P(X|H)P(H) 
P(H|X) = 
oy? pees 10 20 300 40 50 60 P(X) 
Fig. 4.4 Illustration of Linear Regression on a Data Set "0.11, Explain working of naive Bayesian classifier in detail. 
Under the further assumption that the population error term is normally | Or 
distributed, the researcher can use these estimated standard errors to create | s the concept of naive Bayes method for classifying data tuples. 


confidence intervals and conduct hypothesis tests about the population parameters. (R.GPV,, Nov. 2019) 


es O 
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Ans. The naive Bayesian classifier or simple Bayesiay ss payesian EL pl (R.GPV., May 2018) 

ib n 3an e assi H: F 

yS- B ficati J 

as foley Š As fier, Wor, Bayesian classification. (R.GP.V., June 2013) 

G) Let D bea training set of tuples and their associ $ ote on PAY a 

As usual, each tuple is represented by an n-dimensional] a Clagg lab to Q-5 (ii), Q 9 and Q 
X = (Kp, Xp == Mph depicting n measurements made a e oni Veer 4 the advantages and disadvantages of naive Bayes 
z t oe ne ty Or 4 are ; ? (R.GP.V., Dec. 20 
attributes, respectively, A), Ap, ~n Ap. Ple from ° wha C4.5 algorithm : -GEYV., Dec. 2010) 
A ° m ed to © : 

ay ome ELE are tn classes. ae z N r Bayes approach has several advantages. Naive Bayes 
X. the classifier will predict that X belongs to the Bliss m: Siven a tyy mole and easy to understand. Other advantages of naive 


=H ha 
posterior probability, conditioned on X. That is, the naive Bay’ thigh 


Sees a $ Bayesian Clasa: 
predicis that tuple X belongs to the class C; if the only if ®SSifig, ea ENE) E 
PCIX) > P(C IX) for 1 <j <m,j +i of the training data is required. The naive Bayes approach can 


Thus we maximize P(C;|X{). The class C; for which P(C,|X) j i ) 
is called the maximum posteriori hypothesis. By Baye’s theorem Mizeg "the likeliho 


P(XIC;) P(C;) s approach is straightforward to use, it does not 

P(X) factory results. First, the attributes usually are not independent. 
ubset of the attributes by ignoring any that are dependent on 
Jd use 4 St e does not handle continuous data. Dividing the continuous 
eee id be used to solve this problem, but the division of the 
in 5 s is not an easy task and how this is done can certainly 


e model to incremental learning environments 


P(CJX) = 


(ili) As P(X) is constant for all classes, only P(X|C;) PCIM 4 
maximized. If the class prior probabilities are not known, then it is Sea be i 
assumed that the classes are equally likely, that is, P(C,) = P(C,) = ay 
P(C,,), and we would therefore maximize P(X|C;). Otherwise i 7 
P(XIC;) P(C)). By 

(iv) Given data sets with many attributes, it would be ex 
computationally expensive to compute P(X|C;). In order to reduce com 
in evaluating P(X|C;), the naive assumption of class conditional indep 
is made. This presumes that the values of the attributes are cond 
independent of one another, given the class label of the tuple. Thus, 


> We maxi results. 
as disadvantage of naive Bayes is that it is limited to simplified 


nly, that in some cases are 
onels 0m" oblem. To understand this weakness, consider a target attribute 


ts lis ned by a single attribute, for instance, the Boolean exclusive 
iatcannot 


function (SOR). 
915. How is the zero frequency problem handled in naive Bayes classifier ? 


tremely 
putation 
endence 
itionally 


P(XIC) = TI P(x, IC;) (R-GP.V, June 2016) 
e ‘Ans. When probability value becomes zero, then from naive Bayesian 
= P(x1|C;) x P(x2|C;) x ..... x p(xqIC)) susie equation (i), 
(v) Inorder to predict the class label of X, P(X|C;) P(C;) is evaluated beter? P(XIC;) = = : 
for each class Cj. The classifier predicts that the class label of tuple X is the I>) hy PxC) 
class C; if and only if = P(x,|C;) x P(x9|C;) * ..... * POXIC)) (i) 


P(X|C;) P(C;) > P(X|C;) P(C;) for 1 <j <m, j +i 


In other words, the predicted class label is the class C; for which P(X(C)) 
P(C;) is the maximum. : 


KX D is estimated as the product of the probabilities P(x; |C;), 
a P(x,|C;), based on the assumption of class conditional 
æ. From the training tuples, these probabilities can be estimated. 
d to calculate P(X|C;) for each class (i = 1, 2, .....-, M) in order to 
C; for which P(X|C;) P(C;) is the maximum. Let’s consider this 
F each attribute value pair (i.e., Ay = Xp» for kal 2s.) m 


0.12. Why is Naive Bayesian classification called Naive ? Briefly outline 
the major ideas of Naive Bayesian classification. (R.GP.V., Dec. 201) 


Ans Refer to Q.9 and Q.11. 


| 
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tuple X, it is required to count the number of tuples having th 

pair, per class. For example, we include two Glasses a attr 
buys_computer = yes and buys_computer = no. Hence, for Me = 
pair student = yes of X, say, we need two counts — the wo att 
who are students and for which buys_computer = yes ae a 
P(X|buys_computer = yes)) and the number of Customers wh COntrip, 
and for which buys_computer = no (which contributes to p 
=no)). But what when, say, there are no training tuples sh one 
buys_computer = no, resulting in P(student = yes|buys_compute 


ibute 
Valy 


2) 

2 Nar 

Tibut ey 
k 


for ty 
T=no)s 0 Clag 


0.16. What do you mean by distance based algorithms » 

Ans. Each item that is mapped to the same class may bé th 
more similar to the other items in that class than it is to the items fo ne 
classes. Therefore distance measures may be used to identify the ni 
of different items in the database. Using a distance measure for cla 
where the classes are predefined is somewhat simpler than using a 
(distance) measure for clustering where the classes are not known in 


0.17. Write short note on k-nearest neighbor. 


Ans. The k-nearest-neighbor method was first described jn t 
1950s. The method is labor intensive when given large training sets 
not gain popularity until the 1960s when increased computing power bec 
available. It has since been widely used in the area of pattern recognition 

Nearest-neighbor classifiers are based on learning by analogy, that is i 
comparing a given test tuple with training tuples that are similar to it. T 
training tuples are described by n attributes. Each tuple represents a points j 
an n-dimensional space. In this way, all of the training tuples are stored inal 
n-dimensional pattern space. 

When given as unknown tuple, a k-nearest-neighbor classifier Searches 
the pattern space for the k training tuples that are closest to the unknown tupk, 
These k training tuples are the k “nearest neighbors” of the unknown tuple 
“Closeness” is defined in terms ofa distance metric, such as Euclidean distance 
The Euclidean distance between two points or tuples, say} 
X; as (Xib XIZ Xin) and X, = (X91, X225 Xan)» is È 


ght fa 
d in Othe 
al ikeney 
SSificatg 
Similariy 
Advan, 


he Carly 
and dig | 


n 
dist X}, X) = [9° (xi -x2;)? 

E 
In other words, for each numeric attribute, we take the difference between 
the corresponding values of that attribute in tuple X, and tuple in X, squat } 
this difference, and accumulate it. The square root is taken of the totl | 
accumulated distance count. We normalize the values of each attribute befor 
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; oe alization cz > US 
_ Min-max norm li 5 ion can be used to transform a 
‘bute A to v' in the range [0, 1] by computing 
pume v-mina 
y'= 


maxa -ming 

arate the minimum and maximum values of attribute A. 
and neighbor classification, the unknown tuple is assigned the 
among its k-nearest neighbors. When k = 1, the unknown 
class of the training tuple that is closest to it in pattern 
jghbor classifiers can also be used for prediction, that is, to 


eae algorithm for K-nearest neighbour classification given 
a 5 umber 0. f attributes describing each sample. 
i nse Tt represent the training data. Since each tuple to be 
must be compared to each element in the training data, if there are q 
the training set, this is O(q). Given n elements to be classified, this 
O(nq) problem. Given that the training data are of a constant size, 
‘wed as an O(n) problem. 


T // Training data 
K  // Number of neighbors 
t  // Input tuple to classify 


ut — 
ae /| Class to which t is assigned 


KNN Algorithm — 

// Algorithm to classify tuple using KNN 
N=$; 

// Find set of neighbors, N, for t 


| foreach deT do 


if IN| < K, then 

N=Nu {d}; 

else 

if3u e N such that sim(t, u) < sim(t, d), then 
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0.19. Explain the classification by decision tre 
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in detail about the Bayesian and decision tree classifier, 


e induct , S y 
Rope piscus (R.GP.V., June 2016) 
AE Be: Or "Ks Dee 24 Q.5 (ii) and Q.19. 
hog is decision tree ? Explain how classification is done usi " the advantages of decision tree. 
tree induction. (R.G NE dep; (0) 
QP V, ecis yi 
Or June 2 pe advantages of decision tree induction ? 
j] t 


Discuss about classification method using decision tree in phat are (R.GPV., June 2017) 


duct 

mple. Cti E 

examp = zh EA (R.GP V, D Wit, o strengths of the decision tree methods are the following 2 
Ans, A decision tree 1s a flowchart-like tree structure, Where each ; 201 -on trees are able to generate understandable rules. 

node (nonleaf node) denotes a test on an attribute, each branch Te ‘Dem, ble to handle both numerical and the categorical attrib 

outcome of the test, and each leaf node (or terminal node) ho} fe Presents y WF cation of which fields. attributes. 

The topmost node in a tree is the root node. a class labe oe ca Nels are most important 
A decision tree is shown in fig. 4.5. It represents the concept by 

that is, it predicts whether a 

customer at AllElectronics is 


rediction OF 
write the disa A 
of the weaknesses of the decision trees are as follows — 


y 5_computg orp dvantages of decision tree. 


0,22. 


$ ome 3 

likely to purchase a computer. a Some decision trees can only deal with binary-valued target 
Internal nodes are denoted by Others are able to assign records to an arbitrary number of classes, but 
rectangles, and leaf nodes are ; when the number of training examples per class gets small, 


n error-prone 
tis can happen 


gode. f ae a ee ; ; 
(ii) The process of growing a decision tree is computationally 
asive. At each node, each candidate splitting fields is examined before its 


denoted by ovals. Some rather quickly in a tree with many level and/or many branches 
decision tree algorithms 
produce only binary trees 


(where each internal node 


branches to exactly two other ae eit can be found. 
nodes), whereas others can ig. 4. ecision Tree for the Con | ; ; i ; ; isi 
produce nonbinary tree. buys_computer ie 0.23. Describe the basic algorithm for inducing a decision tree from 


pining tuples. 


How are Decision Trees Used for Classification — Given a tuple, XG Or 
which the associated class label is unknown, the attribute values of the tupe | prite decision tree induction algorithm for classification. 
are tested against the decision tree. A path is traced from the root to a latl (R.GP.V., Dec. 2008) 
node, which holds the class prediction for that tuple. Decision trees can easily Or 
be converted to classification rules. Explain the algorithm for constructing a decision tree from training 
The construction of decision tree classification does not require any domain | amples. (R.GP.V., May 2019) 


knowledge or parameter setting, and therefore is appropriate for exploratoy| Ans, J.Ross Quinlan, a researcher in machine learning, developed a 
knowledge discovery. Decision trees can handle high dimensional data. Thei | dcision tree algorithm known as ID3 (Iterative Dichotomiser). In ID3, each 
representation of acquired knowledge in tree form is intuitive and generally | mde corresponds to a splitting attribute and each arc is a possible value of that 
easy to assimilate by humans. The learning and classification steps of decisio | atribute, At each node the splitting attribute is selected to be the most 
tree induction are simple and fast. In general, decision tree classifiers have | informative among the attributes not yet considered in the path from the root. 


good accuracy. However, successful use may depend on the data at hand | Entropy is used to measure how informative is a node. This algorithm uses the 
Decision tree induction algorithms have been used for classification in many aileron of information gain to determine the goodness ofa split. The attribute 
application areas, such as medicine, manufacturing and production, financial eatest information gain is taken as the splitting attribute, and the 
analysis, atronomy, and molecular biology. Decision trees are the basis of | tala set is split for all distinct values of the attribute. A basic decision tree 


several commercial rule induction systems. rithm is summarized as follows — 


O 


ete 
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Basic Algorithm for Inducing a Decision Tree 
Sepu) create a node N; 


= 
3 
= 
Q 
S) 
= 
> 
3 
o 
ta) 


(iv ah The splitting criterion tells us which attribute to te 
Step(ii} if tuples in D are all of the same class, © th ng ning the “best veh ais or partean the tuples in D 
Stepi) return N as a leaf node labeled with the vile t y I py dete se [step (vi)]- aS ae . iii: tel t 
Step(iv) if attriburte_list is empty then SSC; ‘qual A node N with respect to the outcomes of the chosen test 


Steph) return N as a leaf node labele 
D: ‘majority voting 
Step(vd apply Attribute_selection_method (D 
the “best” splitting_criterion: 
Step(vii) label node N with Splitting criterion: 
Step (viii) if splitting_atiribute is discrete-valued and 
multiway splits allowed then // 


d with the Majori 
Jorit 


> attribute 


Je possible scenarios, as illustrated in fig. 4.6. Let A be the 


: istinct values, {a}, a5, ...,a,}, based : 
not restricted tob: has v distinct (Ay, a2 ays on the training 


Step (ix) attribute_list — attribute_list-splitting attr. 


Step(x) for each outcome j of splitting criterion pond directly to the known values of A. A branch is created 
tea jue, a;, of A and labeled with that value [fig. 4.6 (a)]. Partition 
partition the tuples and grow subtrees for each partit known va a H ele d tuples in D having value a ofA. Hecdinie al 
Step (xi) let D; be the set of data tuples in D Satisfying o et! e subset ofc en partition have the same value for A, then A need not be 
Jla partition Mone: py fature partitioning of the tuples. Therefore, it is removed 

Step (xu) if D; is empty then idere 


Step{xīi) attach a leaf labeled with the majority class in D 
Step(xiv) else attach the node returned by Generate _de¢ 
(D};. attribute_list) to node N; 
endfor 
Step (xv) return N; 
At first glance, the algorithm may appear long, but fear not ! Tt is quit 
straightforward. The strategy is as follows — 

(i) The algorithm is called with three parameters — D, attribute ji 
and Attribute_selection_method. D is a data partition. Initially, it is the comple 
set of training tuples and their associated class labels. The parameter attribute liy 
is a list of attributes describing the tuples. Attribute_selection_method specif 
a heuristic procedure for selecting the attribute that “best” discriminates th 
given tuples according to class. This procedure employs an attribute selectin| 
measure, such as information gain or gini index. 

(ii) The tree starts as a single node, N, representing the training} 
tuples in D [step (i)]. | 
(ili) If the tuples in D are all of the same class, then node N becoms} 
a leaf and is labeled with that class [steps (ii) and (iii)]. Steps (iv) and (v)at} Fig. 4.6 Three Possibilities for Partitioning Tuples Based 
terminating conditions. F SaS on the Splitting Criterion 


to nodey, 
tsion tty 


30) 


wy’ 


=winpa 


| A<split_point A> split_point 


/ N 


<42,000 > ae 
/ 


colour € {red, green}? 


r 
> 


~E gals EOY 
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(b) A is Continuous-valued 


= In this Case, 4} 
has Ovo possible outcomes, corresponding to the conditie wae ` 
and A > split_point, respectively, where split Point is t} era $ iN 
by Aunibute selecti thod split on Tae 

Ss _Serecton_method as part of Splitting Criterion Le tin 
grown from N and labeled according to the above Outcomes me branche 
tuples are partitioned such that D, holds the Subset of class.) ae oy O N 
for which A < split_point, while D, holds the rest. a a 

‘nh 


The test at node N is of the form “A € S, 2”, S4 is the splitt be Produ 

retumed by Attribute_selection_method as Part of the splittin ng Subset foe 
subset of the known values of A. If a given tuple has flue Criterion, lie 
a; € Sa, then the test at node N is satisfied. Two branches are 4) OF A ana 
[fig. 4.6 (c)]. By convention, the left branch out of N is labeiea ae from y 
corresponds to the subset of class-labeled tuples in D that satiety, SO that D, 
right branch out of N is labeled no so that D, corresponds to th © test, Th 
class-labeled tuples from D that do not satisfy the test, © Subset of 


(vi) The algorithm uses the same process recursive] 


decision tree for the tuples at each resulting partition, Dj of D rete to fom, 


2 : eS eP (xiv)] 
(vii) The recursive partitioning stops only when any one k 


following condition is true — 
(a) All the tuples in partition D belong to the same Class [s 
(ii) and (iii)). is 
(b) There are no remaining attributes on whi 
be further partitioned [step (iv)]. In this case, Majority v 
[step (v)]. This involves converting node N into a leaf and 


may be stored. 


(c) There are no tuples for a given branch, that is, a Partition 


D; is empty [step (xii)]. In this case, a leaf is created with the majori 
D [step (xiii)]. 


(viii) The resulting decision tree is returned [step (xy)]. 


ch the tuples may 
oting is employed 


labeling it with te 
most common class in D. Alternatively, the class distribution of the Node tuples 
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jon Tree Induction Algorithm — The following 


cis aes . vd 
sof De t characteristics of decision tree induction 


é jmportan 


ae . e induction is a nonparametric approach for building 

on pecisio” rer words, it does not require any prior assumptions 
O dels. In ability distributions satisfied by the class and other 

fica! of pro 

Pos | 
: 3 imal decision tree is an NP-complete problem. 

go nding an opti 


training set size is very large. Furthermore, once a decision tree 
Le classifyin 3 
gs bee? OW), where W IS TE i ; 
p ision trees, especially smaller-sized trees, are relatively easy 
(iv) ba ecuracies of the trees are also comparable to other 
p interpret- Pernis for many simple data sets. 
sification Decision trees provide an expressive representation for learning 
a d functions. However, they do not generalize well to certain types 
pel roblems. One notable example is the parity function, whose value 
Boolean P there is an odd (even) number of Boolean attributes with the 
0 (1) E ccurate modeling of such a function requires a full decision tree 


T iis where d is the number of Boolean attributes. 


(vi) Decision tree algorithms are quite robust to the presence of 
a especially when methods for avoiding overfitting are employed. 

E (ii) The presence of redundant attributes does not adversely affect 
spaccuracy of decision trees. An attribute is redundant if it is strongly correlated 


ty class in yihanother attribute in the data. One of the two redundant attributes will not be 


ssed for splitting once the other attribute has been chosen. However, if the data 
in many irrelevant attributes, i.e., attributes that are not useful for the 


The computational complexity of the algorithm given training set Dis | jp k, then some of the irrelevant attributes may be accidently chosen 
O(n x |D] x log (|D))), where n is the number of attributes describing the tuples | pica tree-growing process, which results in a decision tree that is larger 
in D and |D] is the number of training tuples in D. This means that the San Feature selection techniques can help to improve the accuracy of 
computational cost of growing a tree grows at most n x |D| x log (|D) with) n trees | y eliminating the irrelevant attributes during preprocessing. 


tuples. iii) Since most decision tree algorithms employ a top-down, 
0.24, Write an algorithm for decision tree induction. Give important a itioning approach, the number of records becomes smaller as we 

characteristics of decision tree induction algorithms. 
(R.GB.V., June 2010, 2014) 


Ans. Refer to Q.23. his is known as the data fragmentation problem. One possible 
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solution is to disallow further splitting when the 
a certain threshold. 

(ix) A subtree can be replicated multiple 
shown in fig. 4.7. This makes the 
decision tree more complex than 
necessary and perhaps more difficult 
to interpret. Such a situation can arise 
from decision tree implementations 
that rely on a single attribute test 
condition at each internal node. Since 
most of the decision tree algorithms 
use a divide-and-conquer partitioning 
strategy, the same test condition can 
be applied to different parts of the 
attribute space, thus leading to the 
subtree replication problem. 


Number of recorg 
S fall 


Fig. 4.7 Tree FEA 
The Same § abe oble, 
Different Branch apet q 
0.25. What is classification ? Discuss any one method 
induction. (R.GRY,, 
Ans. Classification — Refer to Q.3. 


Method of Decision Tree Induction — CART (Classificati 
Regression Tree) is one of the popular methods of building decision ar Ang ; 
machine learning community. CART builds a binary decision tree by stl 
the records at each node, according to a function of a single attribute, Ç x 
uses the gini index for determining the best split. CART follows the princi a 
constructing the decision tree. We outline the method for the sake of coe 

The initial split produces two nodes, each of which we now Pe 
split in the same manner as the root node. Once again, we examine sie 
input fields to find the candidate splitters. If no split can be foun ‘ial 
significantly decreases the diversity of a given node, we label it asa leaf node 
Eventually, only leaf nodes remain and we have grown the full decision te 
The full tree may generally not be the tree that does the best job of classifi 


a new set of records, because of overfitting. 
At the end of the tree-growing process, every record of the training si 
has been assigned to some leaf of the full decision tree. Each leaf can nowi 
assigned a class and an error rate. The error rate of a leaf node is the percentag} 
of incorrect classification at that node. The error rate of an entire decision tr 
is a weighted sum of the error rates of all the leaves. Each leaf’s contribution 
to the total is the error rate at that leaf multiplied by the probability that 
record will end up in there. 


of decision 
Nov/Dec, 2 Ite 


00) 
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ı tree induction popular ? Discuss oyer fitting of 
why | ind two approaches to avoid overfitting using suitable 
(R.GPYV., June 2015) 

ion of decision tree classifiers does not need any domain 

c j é gs 

mhe ameter setting and thus is appropriate for exploratory 

Pey High dimensional data can be handled by decision trees 

: ofacquired knowledge in tree form is intuitive and generally 
y humans. The lerarning and classification steps of decision 


piace simple and fast. Decision tree classifiers have good accuracy. 
g3 gyction ful use may depend on the data at hand. 


Mar, success : c : 
gwe” mining, overfitting refers to the situation in wich the induction 
p data nerates â classifiers which perfectly fits the training data but has 


0.27. Explain split algorithm based on information theory. 
‘ Or 


Explain the method for computing best split. (R.GP.V., May 201 9) 
Ans. One of the methods for choosing an attribute to split a node is based 


‘nthe concept of information theory or entropy. The concept is very simple, 


jowever difficult to understand for many. The concept is based on Claude 
shannon’s idea that when you have uncertainty then you have information and 
yhen there is no uncertainty there is no information. For example, when coin 
jys a tail on both sides, then the result of tossing it does not generate any 
formation but when a coin is normal with a head and a tail then the result of 
the toss gives information. 

Essentially information is defined as — p; log p; where p; represents the 
mobability of some event. Because the probability p; is always less than one, log 
pis always negative and — p; log p; is always positive. For those who cannot 
rmember their high schoo! mathematics, we note that log of one is always zero 
whatever the base, the log of any number greater than one is always positive and 
te log of any number smaller than one is always negative. Also, 
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456 Data Mining and Warehousing ndex is represents in fig. 4.8. The ratio of the 


ig of the Om and the 45 degree 
log} =| = —1 ne De porenz CN line is known 
S2 4 n the the 45 degree i 
(tt yee F Be er the ratio, the less 1s a 
loss] — | = -n 4 poe The s™ curves and the more evenly 
Sel on pind >a the two aith. If wealth is evenly 
Information of any event which have several possible oute Aa is ahe person about his/her wealth 
by following equation — Omes ig tiv | eb son at all because ka ice 
1= > Cpi logp;) Po info alth while in a oina E 
i f W sam? gnevenly distributed 2 oe because Fis. 4.8 Representation 
Assume an event that can have one of two possible Vali f pis” erson has gives in aA Carve 
probabilities of the two values be p; and p2. When p} is one and : Let th | yet of wealth distribution. a AN 
there is no information in the outcome and I = 0. When p1 =0. 5 eaten th, fy cuss how the Lorenz curve ea h BEN ; hae 
then the information is Pag now, We fhe equal distribution of welt would lo e ae i : 
I= -0.5 log (0.5) — 0.5 log (0.5) des Wt 10% of the wealth and 20% of the Peop e J re of the 
This comes out to 1.0 (using log base 2) and is the maximum in fo Fn on n) and the actual wealth eee nee es eae 
which you can have for an event with two possible outcomes. Thig h h + calculated in many di ane Men an ae: e a z 
known as entropy and is in effect a measure of the minimum number of pe ed on the total household wea th like om ica oh pees y, 
needed to encode the information. When considered the case of die wi glow is half of the Australian population has very little wealt (only 6%) 
possible outcomes with equal probability, then the information is computed e ip poor tof them do not own a home and they onien have credit card debts 
I= 6 (- (1/6) log (1/6)) = 2.585 y pasemon a This is in contrast to 45% of the nation’s wealth being held by 


loher r 10% of the population. 


Hence, three bits are needed to show the outcome of rolling a dice Wh 
ae 1 Representation of Wealth Distribution in Australia 


the die was loaded so that there was a 50% or a 75% chance of getting ab th 
the information content of rolling the die would be lower as given below, Noe 
that we consider that the probability of getting any of 1 to 5 is equal (that is 
equal to 10% for the 50% condition and 5% for the 75% condition). 
(50%) I= 5 (— (0.1) log (0.1)) — 0.5 log (0.5) = 2.16 
(75%) I= 5 (— (0.05) log (0.05)) — 0.75 log (0.75) =13) 
Thus, three bits are required to show the outcome òf throwinga die that ha 
50% probability of throwing a six but only two bits if the probability is 75%, 
The idea of choosing a split attribute is that choose an attribute which 
decreases the uncertainty by the largest amount. So the attribute ought w 
distribute the objects such that each attribute value results in objects that have 
as little uncertainty as possible. Ideally, each attribute value should give w 
with objects which relate to only one class and therefore have zero information 


0.28. Explain split algorithm based on Gini index. 


Ans. Gini index is proposed by Italian economist Corrado Gini as 4 
measure of resource inequality in a population. The index varies from 0 to|, 
one means maximum possible inequality and zero means no inequality. The 
index is based on the Lorenz curve. This Lorenz curve plots cumulative family 
income against the number of families from poorest to richest. Lorenz cunt 


Actual Fraction of 
Wealth in Australia 
(Cumulative) 


Fraction of Wealth 
is Equal Distribution 
(Cumulative) 


ae und er the 45 degree line is the area of the triangle, which is 0.5. The 


der 
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area below the 45 degree line is (0,5 ation of Crime Distribution j 
snz curve t0 the arci pe : SWS — 9 esentatio ‘ ion in Queensta : 
os A more accurate value (0.35) is shown in table 42. 19/9 5 wh ye 43 Rep” ides nd Gini 
em : ieh fa 

è tation of Income Distributi i ' 

ble 4.2 Represen on Gi 

* Number of Countries ain Inde, h 

4 


India 
Sweden, Japan 
Brazil 

Mexico 

| Malaysia 
Russia 


Drug offences 
Unlawful entry shops 
Stealing from homes 


: z ; ; | 
that crimes like stealing from homes are fairly evenly distributed 


China 
USA 
UK 


Australia js clear “hae s 

naa ak because the Gini index for these types of crimes is small, while 

phe such as prostitution offences are, perhaps not surprisingly, not evenly 

We now, in table 4.2 using figures from the United Nat gimè d because these types of crimes are more frequent in larger cities 


Ons Hy Jer cities and towns. When a dataset S has examples from n classes 


hat there is wide variation į in smal 
Development Report 2004, show t ariation in family, Y panin sm : : ; 
distribution in the countries of the world. Note that the income i Meo 4 Gini index, G(S), is specifies as ke below — 
and wealth distribution are not the same. Usually, the wealth distin G(S)= 1- yi 


worse than income distribution. Again, note that a Gini index o ution; 


unequal distribution and zero means equal distribution. Hence a 
distribution is best in sweden and Japan and worst in Brazil. Not oir Neo 
income distribution in USA is even worse than in countries like Indones 
India which are some times sued as an example of uneven income d 
Gini index can be computed for the whole world because the pop 
GNP per capita of each country in the world is known. When Gini index Wit 
each country is zero then the Gini index for the world for 2003 was (4 
Although, when Gini index within each country to be 0.40 the Gini indy 
the world goes up to about 0.77 perhaps not surprisingly, it represents fy 
income within the world is very unevenly distributed and wealth is even na} 
unevenly distributed. When countries like China, Brazil, India contin) 
grow at the current fast pace the world Gini index could well reduce significa] 
in the next 20-30 years. 


The Gini index has been used in a wide variety of applications 
addition to measuring wealth distribution. For example, the Queens! 
government has analyzed crime data using the Gini index to determ 
how distinct types of crimes are distributed in the state. The output achi 
are shown in table 4.3. 


the relative frequency of class i in S. Wh 
re p, represents the - When a dataset S 
A 5 elements is split into two subsets S, and S, with number of elements n 
respectively, the Gini index of the split data has examples from n classes, 
andis specified as given below — ; 
ei G(S) = (n,/s) G(S,) + (n,/s) G(S,) 
Ulation For the outcome of tossing a true coin, calculating the Gini index as given 
below = 
G= 1-—(0.5)2 — (0.5)? = 0.5 
_. Foran event with two possible values, 0.5 is the maximum value. 
_. Assume that a loaded coin with 70% chance of a head and another coin 
ith heads on both sides. 
'G = 1 - (0:7)? — (0.3)? = 0.42 
G= 1-(1.0)2=0 
Clearly, there is no uncertainty about the coin with two heads and so the 
index is zero. With heads having a probability of 70%, the uncertainty is lower 
an with a coin with equal probabilities and thus the index is again lower. 
. The Gini index for a die with six possible outcomes with equal probability 


` 


G= 1-6 (1/6)? = 5/6 = 0.833 - 


| 
l 
| 
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When a loaded die had much more chance of ge 


even 75%, the index Is given by 
50% G= 1-5 (0.1}°— (0.5)? = 0.79 hy 
75% G=1- > (0.05)? = (0.7592 is oasi 
Clearly, the index is largest with the largest uncertainty. Ta 
) nship between the two measures and the 


tting a six 
8 Ysy 
LA 


that there is NO relatio 
Table 4.4 Representation of Information and Gini Ing 
of Events 


ex foray 


“nh, 


Toss of a true coin 
Toss of a biased coin (70% heads) 


Throw of a true die 
Throw of a biased die (50% chance of a 6) 


Throw of a biased die (75% chance of a 6) 


it should be noted that the maximum value the Gini ing 
one. There is no such maximum value for information, 


ex Can have 
i) 


0.29. Write short note on the pruning technique. (R.G PYV, De 


Ans. The decision tree built using the training set, deals mae 20) 
the most of the records in the training set. This is inherent to the wa Y Wig 
built. Overfitting is one of unavoidable situations that may arise ag r 
construction. Moreover, if the tree becomes very deep, lopsided or bush u 
rules yielding from the trees become unmanageable and diffe 
comprehend. Therefore, a pruning phase is invoked after the constructio, 
arrive at a (sort of) optimal decision tree is known as pruning technique y 

The pruning of the decision tree is done by replacing a subtree bye 
node, The replacement takes place if a decision rule establishes that the exper 
error rate in the subtree is greater than in the single leaf. There are two wil 
thinking in adopting pruning strategies, they are based on whether to Use ty 
same training data set that is used for building the tree for pruning, or to ug; 
separate test data set for pruning. Both the training data set and test daas 
are pre-classified data, that is, the class labels of each tuple are known fortts| 
set. In that case, pruning is carried out based on the training set. In case t 
known data is sufficiently large so as to keep aside a portion of it for testi} 
the second approach is preferable. CART uses the train-and-test approach fr 


Wa 


pruning. In the train-and-test approach, we train on one set and test on anot] 


~ the reduced error pruning algorithm greedily prunes nodes that gis) 


information on the test set. The error is measured based on test dit Š 
classification. The post-pruning approach prunes after the rules are obtained 


from the tree. It removes the preconditions that improve estimated accum 
and filters out poor rules. 
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;n tree pruning. How does tree pruning work ? 
jatii 


a decision tree is built, many of the branches will reflect 
g data due to noise or outliers. Tree pruning methods 
f overfitting the data. Such methods typically use 
e to remove the least reliable branches. An unpruned tree and 
it are shown in fig. 4.9. Pruned trees tend to be smaller and 
d, thus, easier to comprehend. They are usually faster and 
; lassifying independent test data than unpruned trees. 


As ` pe trainin 


t correctly c 


no 


Gass BD 


yes 


Fig. 4.9 An unpruned Decision Tree and a Pruned Tree 
Tree Pruning — There are two common approaches to tree 


yes 


working of 


pruning 0 Prepruning — In the prepruning approach, a tree is “pruned” by 
ving its construction early (e. 8» by deciding not to further split or partition 
a aibset of training tuples at a given node). Upon halting, the node becomes 
ae The leaf may hold the most frequent class among the subset tuples or 
ie probability distribution of those tuples. 
When constructing a tree, measures such as statistical significance, 
information gain, Gini index, and so on can be used to assess the goodness of 
a split. If partitioning the tuples at a node would result in a split that falls 
below. a prespecified threshold, then further partitioning of the given subset is 
halted, There are difficulties, however, in choosing an appropriate threshold. 
High thresholds could result in oversimplified trees, whereas low thresholds 
could result in very little simplification. 


(ii) Postpruning — The second and more common approach is 
postpruning, which removes subtrees from a “fully grown” tree. A subtree at a 
given node is pruned by removing its branches and replacing it with a leaf. 
The leaf is labeled with the most frequent class among the subtree being 
replaced. For example, notice the subtree at node “A,?” in the unpruned tree 
offig. 4.9. Suppose that the most common class within this subtree is “class 
B”. In the pruned version of the tree, the subtree in question is pruned by 
replacing it with the leaf “class B”. 


o 


| 
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The cost complexity pruning algorithm used in C 


$ N í i ART tract rules from a decision tree, we may ne 

the postpruning approach. dbs Spionen considers the co Pi Xa pitis easy 1° e the resulting rule set a ee 
tree to be a function of the number of leaves in the tree ang E COmipley, Plt i tote re work by pm We, condition thi 
tree, It starts from the bottom of the tree. For each interna] a © error pa ut, | a mo given rule aae orn : E Hn ia does not improve the 
the cost complexity of the subtree at N, and the cost com ; de, N, it Coty nt curacy of the rule ce oe i ereby generalizing the rule, 
at node N if it were to be pruned. The two values are co P exity of th My nated is ee e A then prunes the rules using a 
real node N would result in a smaller cost com ] mMpareq, lfp * lbp, Os extrac proach similar to its tree pruning method. The training tuples 

ed, Otherwise, it is kept. A pruning set of cl Piexity, then th “ning Onto ee ated class labels are used to estimate rule accuracy However. 
pa 5 Eoninlexty hie eet dene a ass-labeleg tuples obte p their T ld result in an optimistic estimate, alternatively, the estimate 
err ae tree and of any test a oa z pm: training qa ei ws? ae compensate for the bias, resulting in a pessimistic estimate. In 

i Or accurac > Set ys + ste! not contribute to the overall accurac 

: ; Y estim.. Sed adju e that does erall accuracy of the e 
algorithm generates a set of progressively pruned trees. In gener 1 tin p 4 gjon, 20y rul Ae ied y of the entire 
decision tree that minimizes the cost complexity is preferred al, the sma 4 ast can also 

iy | 
Q.31. What are the two approaches to tree pruning p he? onL NETWORK BASED ALGORITHMS, RULE BASED 
(R.G MNE G HMS, PROBABILISTIC CLASSIF 
Ans. Refer to Q.30. PY, Jung ay | (ee ALGORITHM >, IERS 


Q.32. Explain about the following — ote on neural networks. 


(i) Algorithm for generating decision tree 
(ii) Pre-pruning and Post-pruning approach, 


(R.GP. K; June 20) 


0.34. Write a short n 
Neural networks are large networks of simple processing elements 
ich process information dynamically in response to external inputs 


implified forms of neurons. 
J 


Ans. (i) Algorithm for Generating Decision Tree — Refer to 023 The knowledge in a neural network is distributed throughout the network 
(ii) Pre-pruning and Post-pruning Approach — Refer to 044 the form of internode connections and weighted links which form the inputs 
3 m 
0.33. How to build a rule based classifier by extracting IF- ` | othe nodes. 


THEN rule The link weights serve to enhance or inhibit the input stimuli values which 
ye then added together at the nodes. 
Ifthe sum of all the inputs to a node exceeds some threshold value T, the 
node executes and produces an output which is passed on to other nodes or is 
ised to produce some output response. 
In the simplest case, no output is produced if the total input is less than T. In 
more complex models, the output will depend on a nonlinear activation function. 
Asingle node is shown in fig. 4.10. The input values to the node are x,, 
te Ka which typically take on values of —1, 0, 1, or real values within the 


from a decision tree ? 


Ans. To extract rules from a decision tree, one rule is created fi 
path from the root to a Jeaf node. Each splitting criterion along a give k e 
logically ANDed to form the rule antecedent (“IF” part). The leaf k: ne 
the class prediction, forming the rule consequent (“THEN” part) Aio 

A disjunction (logical OR) is implied between each of the extracted rl 
Because the rules are extracted directly from the tree, they are mt : 
exclusive and exhaustive. By mutually exclusive, this means that we aa 
have rule conflicts here because no two rules will be triggered for.the sam The weigh i 
tuple. By exhaustive, there is one rule for each possible attribute-vale rare) OE WEBNS Wis Wa -Wn conrespond to the oan 
combination, so that this set of rules does not require a default rule. Therefore ae eh am bees Tie pacts eE 
the order of the rules does not matter — they are unordered, VORS F Ta ag ee ee a 

: ; $ i i input values. The sum of the pro 

Since we end up with one rule per leaf, the set of extracted rules jsn i = EER) n serves as sie aa 
much simpler than the corresponding decision tree. The extracted rules my BN ttn a] ; ae Wa 
x ; a combined input to the node. If this sumis x, —~ 

e even more difficult to interpret than the original trees in some cases. The $ 


; 5 large enough t d 
resulting set of rules extracted can be large and difficult to follow, because AT a ee ad pee Fig. 4.10 Model of a Single 
some of the attribute tests may be irrelevant or redundant. So, the plot thickens : ? proouces 


an output Y, an activation function value Neuron 


X2 moe W y. Y= x'w 
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placed on the n odes output links. This Output may then be the ; ural network can perform tasks that a linear algorithm cannot 
nodes or the final output response fror the network. put bb, (ii) Ane MLP is usually reliable for highly dynamic and nonlinear 
Fig. 4.11 shows three layers of a number of interconnect ey (iii) fae 


frt layer serves as the input layer, receiving inputs from Si NOde, 
The second layer (called the hidden layer) receives inputs Ron Set Of 6 Tk 


E f ; Ty: 

ses a pattern of inputs to the third layer, the o the firs t 

and — a Wi Wi2 w uput layer, tay 
v 13 


ès twork have good scope of being used in the i 
en The neural Bees rocessing, dat i powe 
recognition, image pro ng, data compression, weather 


gnature verification, robot control, etc. 
si 


ig i z | 
z Yı ore i r does a neural network work ? Explain, 
0.36. oural network can be thought of as a black box that transforms | 
Ans A he x to the output vector y, where the transformation performed | 
: f S jiput 7 fthe pattern of connections and weights, that is, according to the 
2 u i 
isthe Be the weight matrix W. y 
alls sider the vector product 
Ta y = x*w= ĒxXiWi 
i 
ae co se n a geometric interpretation for 
Fig. 4.11 A Multilayer Neural Network, A It is equivalent to pr ojecting sage 7 
ie. 4 H: ais produ onto the other vector in n- Fig. 4.12 Vector Multiplication 
r 


The patterns of output from the final layer are the networks respo 
the input stimuli patterns. Input links to layer j (j = 1, 2, 3) have weighty 


ELAn 5 
General multilayer networks having n nodes (number of row 


yecto. 
opsional space- : 
a notion is depicted in fig. 4.12 for the two-dimensional case. 


The magnitude of the resultant vector is given by 


is like Vector Projection 


3); 
m layers (number of columns of nodes) will have weights co x*w = |x| |w] cos.0 

n x m matrix w. Using this representation, nodes having no intercom a a 2 M denotes the norm or length of the vector x. Note that this prodact is 
links will have a weight value of zero. lng j v hen both vectors point in the same direction, that is when 8 = 0. 


The product is a minimum when both point in opposite directions or when 
j= 180°. E : 

This illustrates how the vectors in the weight matrix W influence this 
inputs to the nodes in a neural network. 


0.35. Give the characteristics, benefits and uses of neural network, 
SER istics — Some features of biological neural network tha 
ii rior to even the most sophisticated AI computer System are æ 
follows — 


(i) Flexibility — The network automatically adjusts to 4 ney 0,37, Find all possible stable states of the neural network. 


environment without using any preprogrammed instructions. +2 
(ii) Robustness and Fault Tolerance — The decay of nerve cell dos i 

not seem to affect the performance significantly, 2 -1 
(tii) Ability to Deal with a Variety of Data Situations — The netwok 

can deal with information that is fuzzy, probabilistic, noisy and inconsistent zi 
(iv) Collective Computation — The network performs routinely many | Fig. 4.13 


Ans-The above problem is an example of a simple Hoffield net. In which 
processing elements, or units, are always in one of the two states, active or 
active, In the given figure, unit filled by black color is active and units which 
_ 4 blanked are inactive. Units are connected to each other with weighted 
_ @inections. A positively weighted connection shows that the two units tend 


operations in parallel and also a given task in a distributed manner. 
Benefits ~ Some benefits of neural network are as follows — 
(i) The mathematical foundation of a neural network does not ned 
expertise in dynamic programming or linear algebra, beyond the basic gradit 
descent algorithm. 
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noise filtering and optimization of parameter Settings. 


Table 4.5 shows some other applications of neural networ 


Table 4.5 Applications of Neural Netw 


Discipline 


Cardiology 


Diagnostics, Pro 


Otology-Rhinology-Laryngology 


Obstetrics and Gynaecology Prediction 


Biochemistry 


0.41. Present a taxonomy of neural networks. 


Ans. Many neural network systems have evolved over the years. Some of 
the systems which have acquired significance, are as follows — 


(i) Perceptron 


(ii) ADALINE (Adaptive Linear Neural Element) 


(iii) MADALINE (Many ADALINE) 
(iv) Neocognitron 
(v) Hopfield network 
(vi) Hamming network 
(vii) Cauchy machine 
(viii) RBF (Radial Bias Function) 
(ix) ART (Adaptive Resonance Theory) 
(x) AM (Associative Memory) 
(xi) Boltzmann machine 
(xii) SOFM (Self Organizing Feature Map) 
(xiii) BAM (Bidirectional Associative Memory) 


useful for understanding, modeling and treating SPeec} 
impairments, Hearing-aids can w ell be improved by using n and 


orks dica fielg, 
Application Fielg 


ECG Diagnostics Srosiicg 
Intensive care Prediction 

Gastroenterology Prediction 

Pulmonology Diagnostics 

Oncology Diagnostics, Prognostics 
Paediatrics Diagnostics 

Neurology Signal processing, Modellin 
EEG Diagnostics 8 


Signal processing, Modelling 


Ophthalmology Signal processing, Modell 
Radiology Signal processing (X-ray, 
Clinical chemistry Signal processing, Diagnostics 
Pathology Diagnostics, Prognostics 
Cytology Diagnostics, Re-screening 
Genetics Diagnostics 


Protein sequence, 
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(Counter Propagation Network) 


v) CPN 
eura] no Dear ad RNN (Recurrent Neural Network) 
Networks pÈ ae BSB (Brain-state-in-a-box) 
ee (xv) de Correlation) 
ks in me p) CCN (Casca 


pi LVQ (Learning Vector Machine) 


Gradient 
Descent 


AM 
ingle-layer ADALINE 
P forward Hopfield | Hopfield 

Perceptron 
Multilayer CCN Neocognitron 
feedforward | MLFF 

RBF 

Boltzmann 


BSB machine 
Hopfield Cauchy 
machine 


Taxonomy of ANN - Fig. 4.17 shows taxonomy of ANN. 


Neural Net Classifiers for Fixed Patterns 


ing 
US, CT) 


Type of 
Architecture 


Structure 


Cormtinuous Valued Input 


Binary Input 


Supervised 
Training 


Unsupervised 
Training 


Supervised 
Training 


Carpenter 
Grossberg 
Classifiers 


Hopfield 
Network 


Hamming 
Network 


Multilayer 
Perceptron 


Single Layer 
Perceptron 


Kohonen’s Self 
Organizing 
Feature Map 


Fig. 4.17 
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0.42. What do you mean by a multilayer f eed-forwarg r 
~, ` 3 teu 
Explain. i! Nery, k 
è 7 r 
dns The backpropagation algorithm performs le ; 


i ; ; aming oF 
Sed-orwand neural network, It iteratively learns a se 


: tof Weights fo ltil, 
ofthe class label of tuples. A multilayer feed-forward Neural yey ; Medio : 
otan input layer, one or more hidden lavers, and an output laye Work Cong 
; ; $ ee ZER ver. Sig 
ofa multilayer feed-forward network is shown in fig. 4. | 8 3 any) 
g ; t 
Input Hidden Output 
Layer Layer Layer 


Fig. 4.18 A Multilayer Feed-forward Neural Network 

Each layer is made up of units. The inputs to the network corres 
the attributes measured for each training tuple. The inputs are fed simul 
into the units making up the input layer. These inputs pass through the input 
layer and are then weighted and fed simultaneously to a second layer of 
“neuronlike” units, known as a hidden layer. The outputs of the hidden layer 
units can be input to another hidden layer, and so on. The number of hidden 
layers is arbitrary, although in practice, usually only one is used. The Weighted 
outputs of the last hidden layer are input to units making up the output layer, 
which emits the network’s prediction for given tuples. 

The units in the input layer are called input units. The units in the hidden 
layers and output layer are sometimes referred to as neurodes, due to their 
symbolic biological basis, or as output units. The multilayer neural network 
shown in fig. 4.18 has two layers of output units. Therefore, we say that it isa 
two-layer neural network. 

Similarly, a network containing two hidden layers is called a three-layer neul 
network, and so on. The network is feed-forward in that none of the weights cycles 
back to an input unit or to an output unit ofa previous layer. It is fully connected in 
that each unit provides input to each unit in the next forward layer. 


pond to 
taneously 
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re pack propagation ? Write the advantages of neural networks? 
(is 


g , gation is a neural network learning algorithm. A neural 
nected input/output units in which each connection has 
` sed with it. During the learning phase, the network learns by 
ate so as to be able to predict the correct class label of the 


s determined empirically, such as the network topology or 
pio” dral networks have been criticized for their poor interpretability. 
1 is difficult for humans to interpret the symbolic meaning behind 
fot example, © ghis and of “hidden units” in the network. These features initially 


i : a 
eamned He etworks less desirable for data mining. 


s — Advantages of neural networks, however, include their 
f noisy data as well as their ability to classify patterns on 
ney have not been trained. They can be used when you may have little 

dee of the relationships between attributes and classes. They are well- 
mowledg continuous-valued inputs and outputs, unlike most decision tree 
suited a They have been successful on a wide array of real-world data, 
ca ‘handwritten character recognition, pathology and laboratory 
dicine, and training a computer to pronounce English text. Neural network 

“ithms are inherently parallel; parallelization techniques can be used to 
ae up the computation process. In addition, several techniques have recently 
Ti developed for the extraction of rules from trained neural networks. These 
actors contribute toward the usefulness of neural networks for classification 
ind prediction in data mining. 


0.44. Write backpropagation training algorithm and discuss it. 


Ans. Backpropagation learns by iteratively processing a data set of training 
tuples, comparing the network’s prediction for each tuple with the actual known 
target value. The target value may be the known class label of the training 
tuple or a continuous value. For each training tuple, the weights are modified 
soas to minimize the mean squared error between the network’s prediction 
and the actual target value. These modifications are made in the “backwards” 
direction, that is, from the output layer, through each hidden layer down to the 
first hidden layer. The weights will eventually converge, and the learning 
process stops. The algorithm is given as follows — 


Algorithm Backpropagation — Neural network learning for classification 
0r prediction, using the backpropagation algorithm. 


(0) Initialize all weights and biases in network; 


*74 Deta Mining and Warehousing 


Supervised Learning 175 


‘ te an specific outputs for input which are 
$) s ating c sanag 3 to genera ; 
SNT Aite terminating condition 'S Not satisfied { node iS E nes from the center of the Kernel, if the Gaussian 
(aa) for each training tuple X in D { pred f xed radial ed. The whole network creates a linear combination 
aw Propagate the inputs forward - i tions are Pe too: These networks have the name radial basis 

\ ' . . 
iv) for each input layer unit j { watt linea basis ge they are found to be radically symmetric. 
5 PEA n0 cau 
w 0 =l: im network P the architecture of Output 
output of an input unit is its actual Input va] w 4.19 sor network which Hidden Laya 
à i a ig” jon ; 
(vil) for each hidden or output layer unit if ue si Cae namely — input, k 
3 SAN fil e la : 
(viii) lj = Eiw,0; + 0; oe of Sait jayers. The radial x IH 
compute the net input of unit j with Tes “jen and ? etwork architecture 1S 
Pecttothe sai te potion 2 rk. There 
Previoy ys fun dforward netwo PS y2 
l layeri | gal fe of input neurons and ‘k 
(ix) 0, = :? //compute th ; p oul { neurons with the s 
5 J ee eal j pute the output of each unit; s abet of outpu between these two à 
{ back i r player existing ion between the Hypothetical Yk 
(x) //oackpropagate the errors : gid The interconnection o Connection WAS 
{xi) for each unit j in the output layer a and output layer forms Wa e Connection 
Fs idden interconnection > : 
(xii) Err; = O, (1 - 0;) = O)); JI compute Hts W ections m n hidden layer Fig. 4.19 Architecture of RBF 
TE ke 3 or i ‘ aput layer a a dried i 
toy) for each unit j in the hidden layers, from the lasttog, |e tal e The training algorithm is used for updation of 
t 0 


first hidd 
(xiv) on aye 


Err; = Oj @ = 0;) Èk Err, Wik; 
//compute the error with respect to the next higher 


s i layer, k 
(xv) for each weight Wj, in network { 


(xvi) Aw;j = (/) ErrjO;; //weight increment 
Wij = Wij + Aw;j; } //weight update 

(xvii) for each bias 0; in network { 

(xviii) AG; = (2) Err; //bias increment 

(xix) 8; = 0; + AG;; //bias update 

(xx) j5 


0.45. Explain radial basis function network (RBFN) with it’s architecture, 

Ans. M.J.D. Powell develops a radial basis function neural network. This 
neural network can be used for approximating functions and recognizing 
patterns. Gaussian potential functions and sigmoidal non linearities are used 
by the network. Regularization networks is also used Gaussian functions. For 
the values of y the response of this type of function is positive. The respons 
decreases to zero when |y| —> 0. Generally, the Gaussian is denoted by - 

f(y) = (exp)? 
On differentiating the function f(y), we get 


f'(y) =~ 2y(exp)-”” =- 2yf(y) 


pms BYP 
eights 10 
jnown as 


ewok 1 
nitAand 


|| the interconnections. Localized receptive field network is also 
a 
RBF neural network. 


0,46. Define the architecture of perceptron. ; 

Ans, The perceptron refers to a computational model of the eye’s retina. The 
s the combination of three units, i.e. the response unit R, the association 
the sensory unit S. Fig. 4.20 represents the perceptron model. 


Adjustable 


Fixed Weights 


Response Unit 
(S) (A) (R) 


Sensory Unit Association Unit 


Fig. 4.20 Perceptron Model 


The S unit is made up from 400 photodetectors which receives input 
images, The S unit gives a 0/1 electric signal as output. When the input signals 
teed a threshold, then the photodetector output values will be 1 else 0. The 
photodetectors are connected randomly to the association unit A as repr esented 
fig, 4.20, The A unit is made up from feature demons or predicates. For 


| u is arni 
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particular characteristics of the image, the output of the ass 0 but wi > 0, then the weight vector needs to be 
< 


by the predicates, The response unit R is made up from patie Suni bi ifs belongs 1 Cat (w + Aw): i wi. A a result i would p a better 
pereeptrons. The response unit R receives the Predicate ree, "POOR ly ate i owt A assified correctly in the eyi ae staraon ís can be 
from, While the weights of the sensory unit S and association, also in bo oy | ol! i of being ting AW t0 be — ni where n is some positive constant, since 
those of response unit R are adjustable, Unit A are ety ited by selec wd (W- ni). = wi- nii< wi 
The output of the response unit R will be 0, when the wej Me yo (ie are of the length of the non-zero vector i and is hence 
input is less than or equal to 0, otherwise it is the weighted cu in Sum op; fia oe 
also be found by a step function with bipolar values (-1/1) or ne Itself Itog Ik re positive. fwi <0 when the desired output value is 1, the wei ait 
Hence, in the condition ofa step function yielding 0/1 output wal Valuigg we ite similar way, ad so that (w + Aw). i> w-i, which can be performed 
specified as — ” ues, Which ), a needs to be mo jforn> 0, since (w + ni) i= w-it ni-i> w-i. This is the 
we finet) 4 l $i Behn i ejecting Aw k bie perceptron training algorithm. 
=0, Sl a sential rations 3 wa» i, be the training set with p input vectors. Each input 
where net; = > Wij Consider H jse Hy) includes the constant component (i; o = 1) associated 
‘ =l : ctor Go bd weight. For the training set, we assume given a function 
Here, = with es as each sample to either + l or — 1; class (i) = | if the sample 
X; represents the input, Yj represents the output and w. re ; ee to class Cy and class ( i) =— i the ee eae a PS 
weight on the connection leading to the output units (R unit)” Present the | Oe the weights samples are presented repeatedly. p 
5 k : rain ights can be ignored when we consider the sequence of steps at 
Q.47. Discuss perceptron training algorithm, change the wel 


whic 


The pe 
Start with a randomly chosen weight vector wo; 


Ans. A learning procedure called the “perceptron trainin 
be used to obtain mechanically the weights of a perceptron t 
classes, whenever possible. The equation of the separatin 
derived from the weights. The perceptron developed i 
used to classify new samples, based on whether the n 
the new input vector. 


For simplicity, we denote vectors by rows of numbers instead of Column 

4 w 5. 

Ease of presentation. If w = (w4, W,---,;W,) and x= (x, X25---Xn) are two vecton 
then their “dot product” or “scalar product” 


8 algorithm» n 
hat Separates be 
8 hyperplane cante 
n this manner can be 
ode Output is 0 or i 


zA ee exist input vectors that are misclassified by w,_,, do 
Let i be a misclassified input vector; 
Let x, = class (ij)-ij, implying that wy_;.x, < 0; 
Update the weight vector to Wk = Wy) + 1X5 


Increment k; 


). The Euclidean length jv] W-X is Wena at as (W)x, + Wor, end-while; 

*...+ WpX,). The Euclidean length |\v|| of a vector v is (v-v)"2, Several one : 
rm by multilayer perceptrons : 

have reported better performance for problems by using per ç: 0.48. What do you mean by multilayer percep 


Ceptron output 
perceptron with 
e, Corresponding 


Ans. Multilayer feed forward networks is an important class of neural 
networks. Typically, the network consists of a set of sensory units (source 
nodes) that constitute the input layer, one or more hidden layers of computation 
nodes, and an output layer of computation nodes. The input signal propagates 
through the network in a forward direction, on a layer-by-layer basis. These 
neural networks are commonly referred to as multilayer perceptrons (MLPs), 
which represent a generalization of the single-layer perceptron. 


0.49. Discuss the distinctive characteristics of a multilayer perceptron. 


Ans. A multilayer perceptron has three distinctive characteristics re 
(i) The model of each neuron in the network includes a nonlinear 


activation function. The important point to emphasize here is that the nonlinearity 


values € {—1, 1} rather than {0, 1}. Thus we assume that the 
weight vector w has output | iffw-x > 0, and output —1 otherwis 
to the other class. 

The perceptron training algorithm does not assume any a priori knowledge 
about the specific classification problem being solved, initial weights are random, 
Input samples are repeatedly presented and the performance of the perceptron 
observed. The weights are not changed in this step, if the performance on 4 
given input sample is satisfactory, i.e., the current network output is equal to the 
desired output for a given sample. However if the network output differs, the 
weights must be changed in such a way as to reduce system error. For a given 
iteration of the while loop, suppose w be the weight vector and i be the input 


eee = Oe 
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ts smooth {(Le.. differentiable everywhere), as c 


Ainear functions. The actual construct 
IPposed to the of non-linear tun : Ua! CORSITUCHOM Of a 


in Rosenblait’s perceptron. A commonly used form of nonlin limig : a Ah determination of the number of hidden layers and the 
Gus requirement is a sigmoidal nonlinearity defined by the Carty the, bing, plari” oli number of units, is something of a trial-and-error 
i © logistice ¢ Mi ee of the 
We thon 4) Mactio 


, nid function. 

fa ee mean by rule based algorithm ? 

d what ? jon rule, r = (a, c}, consists of the if or antecedent, a, 

1s A a consequent portion, c- The antecedent contains a predicate 
the be? as true or false against each tuple in the database (and 

be SE These rules relate directly to the i 

abe" | traming data). ee 5 corresponding 

Fy could be created. A decision tree can always be used to 
-p tk they are not equivalent. The differences between rules and 


1+exp(—v p 
where, v, is the induced local field (i.e.. the weighted sum of al} 
plus the bias) of neuron j, and y; is the output of the deiro Synaptic 
nonlinearities is important because otherwise the in n. The 
network could be reduced to that of a single-layer 


since íta &. 
for the refractory phase of real neurons. i Mt attempts to 4 


(ii) The network contains one or more layers of hidd 
en 


ā 0 b 
a ot part of inp or output of he network. These hiddonaget | Page 
comp extracting progressiv - z5 3 ee ae ay 
features from the input patterns (vectors). es more mening ee @ The tree has a implied order in which the splitting is performed. 
(sii) The network exhibits a high degree of connectiyig pave 00 order. è 
by the synapses of the network. A change in the connectivity oe determine; pos ay Atree is created based on looking at all classes. When generating 
requires a change in the population of synaptic connections or thee Tewo se one class must be examined at a time. 
eig ; 

0.50. What are the advantages of multilayer per a are algorithms that generate rules from trees as well as algorithms 

perceptron ? R. i P ae ON oy, per rules without first creating decision trees. 
a $ -rog ec, genet ag z f 

Ans. Multilayer perceptron (MLP) is a development from the Ki et process to generate a rule from a decision tree is straightforward. 

perceptron in which extra hidden layers are added, M imple algorithm will generate a rule for each leaf node in the decision tree. All 


a with the same consequent could be combined together by ORing the 
ents of the simpler rules. 


ore tha : 
layer can be used. The network topology is constrained ee 


; á eines 
Le., loop-free. Fig. 4.21 shows the architecture of MLP. Seedforrary 


Algorithm — 
Input : 
X // Decision tree 
Output - 
Y // Rules 
Gen Algorithm — 
R=6 H Mustrate simple approach to generating 


classification rules from a decision tree. 
for each path from root to a leaf in X do 
a = true 
for each non leaf node do 


Fig. 4.21 Multilayer Perceptron 

Generally, connections are allowed from the input layer to the first hidden | 

layer; from the first hidden layer to the second, and so on, until the last hidden | 
layer to the output layer. The presence of these layers allows an ANN W 


c = label of leaf node 
R=RUr=(a,c) 


a = a a (label of node combined with label of incident outgoing arc) 


Br aE: 
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> 2 > 
QS? Define the term probabilistic classifiers 
dna Probabilistic classifiers are dey eloped by assi 

k . 1 
whieh are product distributions over the original att eek 
i ibute 
Bayes) or more involved spaces (as in general Bay it 
al Bayes 


CLUSTERING & 
ASSOCIATION RULE 


MINING 


Probabilistic classifiers and, in particular 
classifier, are among the most popular classific 


and visual processing, 


The study of probabilistic classification is the 

joint distribution with a product distribution, Bayes rule į 
$ ; ! 5 use 
conditional probability of a class label y, and then assum Bt 
assumptio 


the model, to decompose this probability into a prod ns are Made 
probabilities. oa) 


TING AND ASSOCIATION RULE MINING — 
Ue CAL ALGORITHMS, PARTITIONAL ALGORITHMS, 
ERING LARGE DATABASES ~ BIRCH, DBSCAN, 
ys" © CURE ALGORITHMS 


Study of a 


ewe i 29 Pte 
Priy|x) = Pry|x!, x*, ...... x") gl write short note on cluster analysis ? (R.GP.V., June 2008, 2010) 
Or 
L eT ar eee | | p i 
Li. PRR TA e chy: ae what is clustering ? What is the importance of similarity metric in 
, x) oan 7 (R.GP.Y., June 2010) 
m Pr(y j | yy Ply) lus Or 
jsl Prix)’ prite short note on clustering. (R.GP.V., June 2013) 


Ans, The process of grouping a set of physical or abstract objects into 
classes Of similar objects is called clustering. A cluster is a collection of data 
opjects that are similar to one another within the same cluster and are dissimilar 
ip the objects in other clusters. A cluster of data objects can be treated 
wllectively as one group and so may be considered as a form of data 
compression. Although classification is an effective means for distinguishing 
goups or classes of objects, it requires the often costly collection and labeling 
ofa large set of training tuples or patterns, which the classifier uses to model 
ach group. It is often more desirable to proceed in the reverse direction — 
fist partition the set of data into groups based on data similarity, and then 
sign labels to the relatively small number of groups. Additional advantages 
ofsuch a clustering-based process are that it is adaptable to changes and helps 
single out useful features that distinguish different groups. 


Cluster analysis is an important human activity. Clustering is also called 
ula segmentation in some applications because clustering partitions large 
tha sets into groups according to their similarity. Clustering can also be used 
fy outlier detection. Applications of outlier detection include the detection of 
eedit card fraud and the monitoring of criminal activities in e-commerce. For 
“ample, exceptional cases in credit card transactions, such as very expensive 


where x = (x!, ....., x") is the observation and the y 


¥ J ; A 
; j í BX’, EE A yi 
some function g, are independent given the class lab yn ae ee 


el y, 
While the use of Bayes rule is harmless, the final decom 
introduces independence assumptions which may not hold in 
functions g encode the probabilistic assumptions and allow the 
of any Bayesian network, e.g., a Markov model, T 


position step 
the data, The 
representation 
soy PADI e most Common mode 
used in classification, however, is the naive Bayes model in which Vj a 

9 By, 


| Sa er ‘ resp , 
ee, x", x')@ x!, That is, the original attributes are assumed to be independent 
given the class label, 
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N ata sets having fewer than several hundred 
and frequent purchases, may be of intere ty- Small d 


St as possible fy alabili clustering algorithms. Clustering on a 
3 ining function. clucter ; E audule cut d by many ; 3 
data mining function, cluster analysis can be used as a kda P assily handle t may cause to biased results. Thus, there is a 
insight into the distribution of data, to observe the Chattictae ce f jarge data se AN 
and to focus on a particular set of clusters for further ah sic gie lable clustering algo ; 
ay serve as a preprocessing sten & SISUA ly 5 bility — Users want the results of 
may serve as a preprocessing step for other algorithms ltear d Interpretabiuty 
à S, S ‘lity an ; ; F 
atinibute subset selection, and classi fication, which out N characte i) sabi? interpretable, and comprehensible. That is, specific 
detected clusters and the selected attributes or features hen oper pe usab n and applications may need to tie with clustering. 
; etatio 3 ; 
Q.2. Discuss the importance of. similarity Metric in cq rp" "nal Reg uirements for Domain Knowledge to Determine 
difficult to handle categorical data for clustering ? R Gre ; ie ae y clustering algorithms need users to input some 


Ans. Importance of Similarity Metric in Cluste 
In categorical data, the inherent geometric prope 
define the distances between the points and several d 
categorical attributes, on which distance functions a 
Suppose an example of the MUSHROOM dat. 


nalysis. The clustering results are quite aware to input 

: ifficult task to determine the parameters, especially for 
a S ‘mensional objects. This not only burden users, but it 
ity difficult to control. 


ring — R 
Tties can 
ata sets ą 
Te not na 


Used jy i 
n 
de up op | yn se paving “i: 


t 
Urally defin makes clu 


lso ma 


a set in the ; t are the requirements of cluster analysis? 
leaming repository. A sample of gilled mushrooms is ioe Machine. Define clustering. Wha (R.GP.V., June 2012) 
in the data set using twenty-two categorical attributes. For example aes tuple ing — Refer to Q.1. 
attribute of cap may take values from the domain {red, pink, yal COloy, stering Pte Aralysin Rete OOA, 
white, cinnamon}. In that case, it is really hard to determine that Oi W, ray jrements of Clu ae 
more than another colour in a way similar to real numbers. It is a] Colour What is cluster analysis ? What are some typical applications of 
measure the dissimilarity among such attributes 5o hardy | g5. 


or to define an order 
Q.3. What are the requirements of cluster ing in data mining ? 


pote 2018) i is h id lications 
Ans. The typical requirements of clustering in data mining are discusse] | Applications of Clustering — Cluster analysis has wide a oe 
Be OnE 7 ing market or customer segmentation, pattern recognition, biologica 
eee tial data analysis, web document classification, and many others. 
ak a es can be used as a stand-alone data mining tool to gain insight 
o a data distribution or can serve as a preprocessing step for other data 
nining algorithms operating on the detected clusters. 


Sa 
. : me typical requirements of clustering in data mining ? 
ing, | ering ? What are some Yp (R.GPV, Dec. 2011) 


Ans. Cluster Analysis — Refer to Q.1. 


(i) Ability to Deal with Noisy Data — Generally, 
databases have unknown or erroneous, outliers, or 
algorithms are aware to such data and may cause 


: most real-world 
missing data. Some clustering 
to clusters of poor quality. 
(ii) High Dimensionality — A data warehouse may have several 
dimensions or attributes. Low-dimensional data which have two or three 
dimensions are handled easily by many clustering algorithms. The tough task 
is to find the clusters of data objects in high dimensional space which consider 
that such data can be highly skewed and sparse. 


(iii) Discovery of Clusters with Arbitrary Shape — In data mining, 
cluster could be of any shape. Thus, to develop algorithms that can detect 
clusters of arbitrary shape is important. 


(iv) Ability to Deal with Different Types of Attributes — Clustering 
of numerical data is easily done by many algorithms. However, there is a va: 
of applications which may require clustering other types of data, such as ordines 
binary, and nominal data, or mixtures of these data types. 


Requirements of Clustering — Refer to Q.3. 


„ó. Li di ce between clustering and classification. 
i lab (R.GP.V., June 2016) 


Ans, There are several differences between clustering and classification 
ss follows — 
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Aims t identify similarities amon, 
data 


@ Aims to 


Venfy whe pobastnesS ined from a variety of sources has errors 
ngs to pi daty w fa obtam 


S ost da ; i 
Specifies required change vate ae. Mb (a) M fore, the method should be able to deal with noise, 
vement ) | (b) Th ae i gracefully. 


wi gd ese 
f = Ability 
i (a) Clusters 


to Discover different Cluster-shapes — 
appear in different shapes and not all clusters are 


hould be able to discover cluster- 
Pa fore, the method s 
(p) There 


oher tha e 
 pifferent Data Types — 
ie problems have a mixture of data types, for e.g. 


= 


Process is more complex 


TOE ; @) rical and even textual. 
O7 Es ont desired features of cluster analysis method ie E erefore, the method should be able to deal with numerical 
Ans. There are several features of cluster analysis as follows _ (b) eiai a. 
Œ Scalability — prs 


” (viii) Result Independent of Data Input Order — 


AS ge cee oe eos cam be la (a) Therefore, the method should not be sensitive to data input- 
a z 


analysis method should be able to deal with large 
(b) Ideally, performance should be 


rge and therefore a 

Problems gracefully ei 

linear with data-size 

(c) The method should also scale well to datasets in yr. 

number of attributes is large. ri 
(ii) Only One Scan of the Dataset — 


(a) For large problems, data must be stored 
disk can become significant in solving the problem. 


(b) Therefore, the method should not re 
scan of disk-resident data. 


(iii) Ability to Stop and Resume — 


(a) For large dataset, cluster-analysis may require huge 
processor-time to complete the task. 


(b) In such cases, the task should be able to be stopped and then 
resumed when convenient. 


as (b) Irrespective of input-order, the result of cluster-analysis of 
sesame data should be the same. 


i is i “classification” and 
lain briefly the differences between “c. assi 

gf os a give an informal example of an application that would 
jnefit from each techniques. (R.GP.V., May 2019) 

Ans. Refer to Q.6 and Q.22 (Unit-I). 

09. Explain different data types used in clustering. (R.GP.V., May 2019) 

Ans. Datasets come in a number of different forms. The data may be 
gaulitative, binary, nominal or ordinal. 

: () Data Matrix (Object-by-variable Structure) — It is represents n 
jects, (Such as persons) with p variables (or attributes) (such as age, height, 
weight, gender, race and so on. The structure is in the form of relational table 
wnx p matrix as shown below — 


on disk, so Cost of ig 


quire more than One 


(tv) Minimal Input Parameters — 


Biber Mg 8 yy 
(a) The method should not expect too much guidance from the 

data-mining analyst. eS 
(b) Therefore, the analyst should not be expected to have domain 


knowledge of data and to posses’ insight into clusters that might exist in the 


ps oe Xgp tt Xgy —> called as" two mode" matrix. 
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(ii) Dissimilarity Matrix (Object-by-objecy Str, 
collection of proximities (closeness or distance) that ar 
of n objects. lt is represented by an n-by-n table as sh 

0 
d(2,1) 0 
d3,l) d(3,2) 0 


d(n,1) d(n,2) > + 0 > calledas" 


where d(i, j) is the dissimilarity between the objects į and 
d(i, i) = 0. 

Many clustering algorithms use dissimilarity Matrix. So data re 
using data matrix are converted into dissimilarity matrix before ipa 
clustering algorithms. Clustering of objects done based on their site 
dissimilarities. Similarity coefficients or dissimilarity coefficients = z 
from correlation coefficients. e 


J; dG, j) = dG) 


Digg 
SU} 
Cso 
Tieg 

(iti) Quantitative (or Numerical) Data — It is quite commo 
example, weight, marks, height, price, salary, and 


3 for 
of methods for computing similarity between quantitati 


€a Numb 
ve data, i 


(iv) Binary Data — It is also quite common, 
and marital status. As we have noted earlier, computing 
between categorical variables is not as simple as for q 
number of methods have been proposed. 


for example, gender 


similarity or distan 
uantitative data buta 


(v) Qualitative Nominal Data — It is similar 


may take more than two values but has no natural order, 
foods or colors. 


to binary data Which 
for example religion, 


(vi) Qualitative Ordinal (or Ranked) Data 
data except that the data has an order associated wit 
A, B, C, D, sizes S, M, Land XL. The problem of me: 
ordinal variables is different than for nominal varia 
values is important, One method of computing distance involves transferring 
the values to numeric values according to their rank. For example, grades A, 


B, C, D could be transformed to 4.0, 3,0, 2.0 and 1.0 and then one of the 
methods in the next section may be used. 


— It is similar to nominal 
h it, for example, grades 
asuring distance between 
bles since the order of the 


association rule mining ? Give an example 
he real world. (R.GBV., June 2015 
Association rule mining consists of first finding frequent itemsets 
h strong association rules in the form X — Y are generated, wher 


0.10. What do you mean by 
of market basket analysis from t 

Ans. 
from whic 
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tems or itemsets, X being the rule antecedent and Y being 
ther words, a rule antecedent is the portion of a 
soe to be satisfied in order that the rule consequent is 
sae tisfy a minimum confidence threshold. Association 
; Ee data mining technique looking for connections 
pemi ne records belonging to a large dataset. 


i the other item that 
jem item is to be purchased if the customer has bought 
jkely an 1$ tified as having an association with the first item compared to 
it bei hased without the other item being purchased. 
+ olihood of it being purc 
ikelihoo 
for ee of customers. On the other hand, using the bar code scanners 
z ; a supermarket database consists of a large number of transacfion 
ds, listing all items bought by a customer on a single purchase transaction. 
recor S, 


r atalog design, for identifying the customer segments based on their 
or ¢ Y 
purchase pattern. 


Q.11. Explain association rule in mathematical notations. Define support 
and confidence in association rule mining. (R.GP.V., June 2012) 
Or 
i -oncept of frequent sets, confidence and support. 
Discuss the concept of freq PER 


Ans. Assume I = {1}, Iy,......., Im} be a set of items and D be a set of 
database transactions in which each transaction T is a set of items such that T 
cl. Each transaction is related with an identifier known as TID. Suppose Abe 
aset of items. A transaction T contains A if and only if A c T. An association 
rule is an expression of the form A => B, where A and B are subsets of I, ea 
AQB= 6. The rule A > B is true in the transaction set D with ae : 
where s is the percentage of transactions in D that contain A Tra a j 
taken to be the probability, P(A U B). That is, support (A> n ( y eh 
The rule A => B has confidence c in the transaction set D, : ~ z e 
percentage of transactions in D containing A that also contain B. This i 
to be the conditional probability, P (BJA). That is, 


confidence (A = B) =P (BJA). 


o 
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Rules that satisfy both a minimum Support thre 
minimum confidence threshold (min_conf) are 
We write support and confidence values 
rather than 0 to 1.0, 


shol 
called Siong, R 
S0 AS tO OCCUT be ii 


d (min 


A set of items is referred to a 


S$ an iemser, An itemse 


is a k~itemset. The occurrence frequency of an itemse? 4 etki 
transactions that contain the itemset. This is also known wee Umber > 
Suppori_count, or count of the itemset. oa Preaueng, 
The association with a very high support and Confidence jg | 
occurs often in the database that should be obvious to the std i Pattern that 
with extremely low support and confidence should be regar B Patterns 
significance. Only patterns with a combination of intermediate aS of ng 
confidence and support provide the user with interesting and we of 
unknown information. previous 


Q.12. Define support and confidence in association rule minin 2 


(R.GRY, J, 
Ans. Refer to Q.11. ‘ne 2016 


Q.13. Discuss the importance of discovering association ie 
(R.GPV, Dec. 291) 


g them the problem of 
t concern of attention, 


Ans. There are several areas of data mining. Amon 
discovering associations from data has received a grea 
This problem is often considered as the market-basket problem. The solution 
is derived from a set of items and a large collection of transactions, which are 


subsets (baskets) of these items by finding relationships between the presences 
of different items within these baskets. 


This framework can be fitted in various applications of data mining T he 
most popular example of this is the supermarket. This context of problem is 
analyzed for buying habits of customers by deriving associations between the 
numerous items that customers place in their shopping baskets. The discovery 
of such association rules is very helpful for the retailer to develop marketing 
strategies, by gaining insight matters such as “which items are most frequently 
purchased by customers”. It is also helpful in sale promotion strategies, 
inventory management, etc. 


0.14. Define single-dimensional and boolean association rules. 


} ; j oe one 
Ans. If the items or attributes in an association rule reference only 
dimension, then it is a single-dimensional association rule. 


nitis a multidimensional association rule. The following 
and DUS: © f a multidimensional rule — 
pis %, “20...39 ^ income(X, “42K...48K”) = buys(X, “high 
fl el : ace 
ag 
es associations between the presence or absence of items, 
iation rule. For example, 


ean assoc! 


«<a bool “laptop_computer”) => buys(X, “HP printer”) 


buys(X, ; ee 
What do you mean by multilevel association rules ? Discuss 
OD id to this approach. 
gener variations ne 
e multilevel association rules. 
Or 


rite a short note on multilevel association rules. (R.GP.V., Nov. 2019) 


Ans, Association rules generated from mining data at multiple levels of 

i are called multiple-level or multilevel association rules. Multilevel | 
Ma rules can be mined efficiently using concept hierarchies under a ! 
ee ine framework. In general, a top-down strategy is employed, | 
pee counts are accumulated for the calculation of frequent itemsets at each 
imoept level, starting at the concept level 1 and working downward in the 
jierarchy toward the more specific concept levels, until no more frequent 
iemsets can be found, For each level, any algorithm for discovering frequent 
iemsets may be used, such as Apriori or its variations. There are different of | 
variations to this approach, where each variation involves “playing” with | 


support threshold in a slightly different way. Some are described below — 


: (R.GPYV., Dec. 2008, 2011) 
Explain th 


a (i) Uniform Support — The same minimum support threshold is 
sed when mining at each level of abstraction. In fig. 5.1, a minimum support 
titeshold of 5% is used throughout. Both “computer” and “laptop computer” 
te found to be frequent, while “desktop computer” is not. 

Level 1 
mnin_sup = 5% 


computer [support = 10%] 


Level 2 
min sup =5% 


laptop computer [support = 6%] desktop computer [support = 4%] 


Fig. 5.1 Muttitevel Mining with Uniform Support 
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tion due to the “ancestor” relationships 
When a uniform minimum support thre ice! 


shold js used 


is Simplified. The method is also simple in that us > the Search multi example, consider the Series sie ht 'aptop 
Only one minimum support threshold An A ae 1s "quiteg ete s stor of ere api a ENS 
À y > Saota, AN Apriori-like op log, ance A senting customers who pur 
can be adopted, based on the know ledge that an ore : UMization tech if ute X isa variable repre g purchased 
Rear tes t. Ripe’ i ipy 3 StOr is 9 © I i ions. 
descendants. The search avoids examining itemsets contain Sa Uperge ug jectronics transaction 
ancestors do not have minimum support, "Bany item a ` (X, “laptop computer”) = buys(X, “HP printer”) 
x S9 z : Og, Uys»: 
The uniform Support approach, however, has some difficult; e buy! [support = 8%, confidence = 70%] 
that items at lower levels of abstraction will les. It} 
; = = occur as fre u 8 Unlik 39 6s ; 39 
EE. 3 aa e e $ uter”) => buys(X, “HP prin 
higher levels of abstraction. If the minimum support cheesy lY as those ò buys ‘IBM laptop comp ) ys( printer”) 
it could miss some meaningful associations occurring at 16 'S Set tog high LEAD OORT 27, confidente Taaa 
g ; 


Abstract; 
res - 10 
generate many uninter Nevers 


; fa rule R,, if R, can be obtained by replacin 
esting ASsociati js an ancestor O 2 l y rep g 
Ong 


py their ancestors in a concept hierarchy. For example rule I is 
y IL because “laptop computer” is an ancestor of “IBM laptop 
Minin made - this definition, a rule can be considered redundant if its 


Taction, the s ince » 
> ae smaller are close to their “expected” values, based on an 
- 5.2, the minimum aut nd confidence 

Ort 


respectively. In this way « 10 
“laptop computer” and “desktop computer” are all considered freee Aer, 4 supp 
> m 


aM laptop computers”. We may Tn rule II to nave a confidence of 
min_sup = 5% 70% (since all data samples of “IBM laptop compii Sate ase sanies 

ound omputer”) and a support of around 2% (i.e. 8% x 1/4). If this is 
te fe, then rule II is not interesting because it does not offer any 
ie al information and is less general than rule I. 


occurmng at high abstraction levels. 


support threshold. The deeper the level of abst 
corresponding threshold is. For example, in fig 
thresholds for levels 1 and 2 are 5% and 3%, 


computer [support = 10%] 


Level 2 
min_sup= 3% 


laptop computer [support = 6%] 


desktop computer [support = 4%] 
0.17. What are multidimensional association rules ? Mention few 
gproaches to mining multilevel association rule.  (R.GP.V., June 2012) 
i X Or 

Discuss the detailed concept of mining multidimensional association 


mles from relational databases. (R.GP.V., June 2013) 


Ans, Association rules that involve two or more dimensions or predicates 
tan be'referred to as multidimensional association rules. For instance, the rule 
‘age(X, “20.....29”) A occupation(X, “student”) = buys(X, “laptop”) 
wntains three predicates (age, occupation, and buys), each of which occurs 
mly once in the rule. Hence, we say that it has no repeated predicates. 
Multidimensional association. rules with no repeated predicates are called 
merdimensional association rules. We can also mine multidimensional 
‘sociation rules with repeated predicates, which contain multiple occurrences 
‘some predicates. These rules are called hybrid-dimensional association rules. 
An example of such a rule is the following, where the predicate buys is 


Fig. 5.2 Multilevel Mining with Reduced Support 


(iii) Group-based Support — Because users or experts often have 
insight as to which groups are more important than others, it is sometimes 
more desirable to set up user-specific, item, or group-based minimal support 
thresholds when mining multilevel rules. For example, a user could set up the 
minimum support thresholds based on product price, or on items of interest, 
such as by setting particularly low support thresholds for laptop computers 
and flash drives in order to pay particular attention to the association pattems 
containing items in these categories. 


0.16. Discuss mining of multilevel association rules and explain how 
to check redundant multilevel association rules. 


(R.GP.V., Nov./Dec. 2007, 2009, June 2014) 
Ans. Mining of Multilevel Association Rules — Refer to Q.15. 


Checking of Redundant Multilevel Association Rules — A serious T ie s 
effect of mining multilevel association rules is its generation of many redun © 1 aRe(X, “20.....29”) A buys(X, “laptop”) => buys(X, “HP printer”) 


“a 
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Data : Sens eee s tabase or data warehouse. Such data stares are 
have a “Se a ria e oe Categorig jn 3 i ton For instance, in addition to keeping track of 
Quanniative atuributes are numeric and hav “i ate among y ti | on sional i sales transactions, a relational database may record 
Techniques for miming multidimensional eel Ordering tinge Valug, oe? purchas š ciated with the items, such Fs Be quani si purchased rf 
init tere basic Aa ia oon, $ ciation rules can be ¢ ng Value, x tres asso ch location of the sale. Additional relational information 

5 = treatment of quantit lego), | BAe, or the bran ho purchased the items, such as customer are, 
> ‘ R O ative 2eq 0 mers who p ws thet 
() Static Discretization of Quantitative Attrip attributa, i nie the custor income, and address, may also be stored. Considering 
approach, quantitative attributes are discretized Using a ~ In the f tion, An warehouse dimension as a predicate, it can therefore 
hierarchies. This discretization occurs before mining. Fo refined Coli oi gase tribute association rules containing multiple predicates, such 
hierarchy tor income may be used to replace the or kieti PR a con pe sing t0 min 
aa Aroq ere ra “ass as “0... 20K”, cogs Values or f 0 79”Y*occupation(X, “student”) implies buys(X, “laptop”:) 
Rib etene. Š so on. Here, discretization i Sy ag N 30k» “Dere ce 
The discretized numeric attributes, with their interval abel Peering om the dy pamir APA VE association Me ming or 
as categorical attributes (where each interval is considered ac en be treated Qi” nal association rules. 


reier to this as mining multidimensional association rules ategory), We golidime itative association rules are multidimensional association rules 


discretization of quantitative attributes. using Static aie numeric attributes are dynamically discretized. Let, 
(ii) Dynamic Quantitative Association Ry ‘ Aguant  Aquan2 > “eat 


Š 3 s les — In aA 3 4 
approach, quantitative attributes are discretized or om the Secong and Ag „an2 are tests on quantitative attribute intervals, and A 


j oe a AL cat 
on the distribution of the data. These bins may be Ble Saat ae baseq ti a rical attribute from the task-relevant data. Such rules have been 
mining process. The discretization process is dynamic and established aE = fered to a two-dimensional enh SEE 
satisfy some mining criteria, such as maximizing the confidence obs as to get's look at an approach used in a system called ARCS (Association 
mined. Because this strategy treats the numeric attribute values as Tee gule Clustering System). This approach maps pairs of quantitative attributes 


rather than as predefined ranges or categories, associatio 


sno 2-D grid for tuples. The grid is then searched for clusters of points from 
this approach are also referred to as (dynamic) quantitati 


stich the association rules are generated. The following steps are involved in 
ARCS. 

Binning — Just think about how big a 2-D grid would be if we plotted age 
mdincome as axes, where each possible value of age was.assigned a unique 
position on one axis, and similarly, each possible value of income was assigned 
sunique position on the other axis ! To keep grids down to a manageable size, 
weinstead partition the ranges of quantitative attributes into intervals. These 
ilervals are dynamic in that they may later be further. combined during the 
nming process. The partitioning process is referred to as binning. 

Three common binning strategies are as follows — 


E @_Equal-width Binning —-Where the interval size of each bin is 
same, 


n rules mined from 
ve association Tules, 


Q.18. Explain mining multidimensional Boolean association rules ‘from 


transaction. (R.GP.V., May 2019) 

Ans. We have studied association rules that imply a single predicate, that 
is, the predicate buys, for instance, in mining our ABC company database, we 
may discover the Boolean association rule “IBM desktop computer” implies 
“Sonyb/w printer” Which can also be written as 

buys(X, “IBM desktop computer:”)implies buys(X, “sony b/w printer’) 
where X is a variable representing customers who purchased items in ABC 
company transactions. Following the terminology used in multidimensional 
database, we refer to each distinct predicate in a rule as a dimensional. Hence, 
we can refer to the above rule as a single-dimensional or intradimension 
association rule since it contains a single distinct predicate(e.g..buys) with 
multiple occurrences.(i.e., predicate occurs more than once within the rule). 
Such rules are commonly mined from transactions data. Suppose, however, 
that rather than using a transactional database, sales and related information 


K ; i), Equal-frequency Binning — Where each bin has approximately 
"esame number of tuples assigned to it 


k (ii) Clustering-based Binning — Where clustering is performed on 
ntitative attribute to group neighbouring points into the same bin. 
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Finding Frequent Predicate Sets 
count distribution for cach category is set up, it can t 
an gory is s sit can b 

frequent predicate sets thatalso 


tules can then be generated froi 
; g m these i 
algorithm, Predicate 


= Once the 2.1 arr 


Clustering the Association Rules — 


Jia - Ler 
Quanttative association rule is. 


ae X, “30..39°) a mcome(X, “42K. 48K") > buys( x x 
N in : Vs X, « 
Fig. s 3 shows a 2-D grid for 2-D quantitative association >` HDTy>) 
condition bays (X, “HDTV") on the rule right-hand side ahi Predicting 
attnbutes age and income. The four Xs correspond to ce its quantita 
ase, $4) a income(X, “31K...40K") = buys(X. “HD š 
aga X, 35) a încome(X, “31K... 40K") > buys(X. MPA 
age(X. 34) ^ income(X, “41K...... 50K°) > buys(X, “ ee 
age(X, 35) ^ income(X, “41K......50K”) = buys(X. “HDT 
These rules are quite “close” to one another. Ti 
3 , forming a 
end. The four rules can be combined or “clustered” a $ Seron the 
following simple rule, which subsumes and replaces above four ka the 
age(X, “34.....357) A income(X, “31K...... 50K’) > buys(Xx, ane 
ARCS employs a clustering algorithm for this purpose. The al 
scans the srid, searching for rectangular clusters of rules. In this wa = thm 
the quantitative attributes occurring within a rule cluster may y on o 
combined, and hence further dynamic discretizati S ni 


on ofthe quantitati : 
occurs. quantitative attributes 


mkiox [=] 
T | _] _] 


233s S435 836237258 
mie oi SSE. 


Fig. 5.3 A 2-D Grid for Tuples Representing Customers 


0.20. What are the methods of association rule mining ? Explain them. 


(R.GP.V., Dec. 2003) 

Or 
Explain in brief that how we can mine different kind of association 
rules. z (R.GPV., June 2013) 
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jous kinds of association rules for additional application 
gre vat! our scope to include mining multilevel association 


sos There extending 


i mensio jational databases and data warehouses are as follows, 
oe 


(i) Mining 
ai Mining Quan : 
iain Naive a [gorithm in brief for generating frequent timesets, 
0 DI. Exp" jest method to calculate frequent itemsets is to consider all 
ae alculate their support, and check whether they are higher 
iple item support threshold. Given that 2™ itemsets must be searched 
ree as must be scanned each time, this algorithm needs O (2™ n) 
yl p transactio er increases exponentially with the number of items. Hence, 
: taken by calculation for larger problems. Since 2™ is due to the 
find a method to reduce the number of tests. The 
assume safely will not give frequent itemsets do not require 

that we can 4 ee : 
eset 1 This reasoning resulted in the development of the apriori algorithm. 


be! algorithm for generation of frequent itemsets is given below — 
A Naive 


ing Multilevel Association Rules — Refer to O.15. 
inin 
se Multidimensional Association Rules ~ Refer to 0.17. 


titative Association Rules — Refer to Q.19. 


Step (i) START 

step i)  n=|D]| 

Step (iii) for each subset s of i do 

Step (iv) i=0 

Step (v) for each transaction T in D do 
Step (vi) if s is a subset of T then 

Step (vii) i=i+1 

Step (viii) if minimum support <= i/n then 
Step (ix) add s to frequent subsets 

Step (x) END 


0.22. What are the different categories of clustering methods ? 
(R.GP.V., June 2009, 2014) 
Or 
Compare and discuss the various approaches to clustering. 
(R.GP.V., June 2011) 


Or 
~~ Explain the major classification of clustering methods. 
(R.GP.V., May 2018) 


Fie 
<-> 


Or 


__ Discuss categorization of clustering methods.  (R.GP.V., Nov. 2019) 


AA 


' 
on Date Mining and Warehousing 


Ana The major clue 
categones 


liven a databay 


’ c of ne 
k partitions of b 


the data 


a partitioning method constructa 
pPanition represents a cluster and k 
Poupa, which together satisfy the fe 


(a) Bach group Must contain at least one object 
5 $ 5 : 
(b) Each object must belong to exactly one grou 
p. 

be relaxed i 4 

n some fuzz 

A ny £ ‘ 
Partitioning 


Given k, the numberof partitions to construct, a p 


š ) artitionin 
Men uses an iterative relocat; 


; 4 Method Creates 


the quality of partitions. 
To achieve global optimality in partitioning- 
| the exhaustive enumeration of all of the pos 


based clustering Would require 
applications adopt one of a few popular heuris 


tic methods, such Pe A 5 


(a) The k-means algorithm, where each cluster ig 


re 
by the mean value of the objects in the cluster ppu 


' (b): The k-medoids al gorithm, where each cluster is represented 
by one of the objects located near the center of the cluster, Hia 


(ii) Hierarchical Methods ~ A hierarchical method creatės-a 

' hierarchical decomposition of the given set of data objects. A hierarchical 
method can be classified as being either agglomerative or divisive, baséd on 
how the hierarchical decomposition is formed. The agglomerative approach, 
also called the bottom-up approach, starts with each object forming a separate 
group. It successively merges the objects or groups that are close to one another, 
until all of the groups are merged into one, or until a termination condition 
holds. The divisive approach, also called the top-down approach, starts with 
all of the objects in the same cluster. In each successive iteration, a cluster is 
split up into smaller clusters, until eventually each object is in one cluster, or 
until a termination condition holds. 


In hierarchical methods, once a step is done, it can never be undone. Due 


to this, it leads to smaller computation costs by not having to worry aoi 
combinational number of different choices. However, such techniques canno 
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s for improving the quality 
There are two appr waches 
decisions. 


ney hierarchical agglomeration and other approaches 
D l Fa agglomerative algorithm to group objects into 
sing t hierar arming macroclustering on the microclusters using 
p Sn en eras iterative relocation as in BIRCH. 
„g method such as itera ve re 
re oe sased Methods — Most partitioning methods cluster 
on pie and encounter difficulty at discovering clusters of 
i er clustering methods have been developed based on the 
eae Their general idea is to continue growing the given cluster 
sity in the “neighbourhood” exceeds some threshold; that is, 
ings tHe ‘at within a given cluster, the neighbourhood of a given radius 
prach daa Po least a minimum number of points. Such a method can be 
Te wet noise and discover clusters of arbitrary shape. 
gito ia and its extension, OPTICS, are density based methods that 
Mes according to a density based connectivity analysis. DENCLUE 
ero that clusters objects based on the analysis of the value distributions 
T, Grid-based Methods — Grid-based methods quantize the object 
iato a finite number of cells that form a grid structure. All ofthe clustering 
seations are performed on the grid structure. The main advantage of this 
proach is its fast processing time, which is typically independent of the 
amber of data objects and dependent only on the number of cells in each 
finension in the quantized space. STING is a typical example of a grid-based 
= Ø) Model-based Methods — Model-based methods hypothesize a 
adel for each of the clusters and find the best fit of the data to the given 
model. Amodel-based algorithm may locate clusters by constructing a density 
imetion that reflects the spatial distribution of the data points. It also leads to 
awa tically determining the number of clusters based on standard 


x 


ling. COBWEB is a conceptual learning algorithm that 
analysis and takes concepts as a model for clusters. SOM 


rae 


=< 
SS Oste Mrway and Ww arehousing 


Sa neural nework-hased aj 


he 


gonthm thate 
“2 mt 4 2-D or 3-D ¢ i 


cature 


Ts 
fre) OC} > ta “On 
ead uterine High-dimensiona| D “'Sualizag a 
“SSK th Chaster analvsis because man ata — it is particu ation 
os se many arly 
contaming y applications require the ana} y i 


pa anaia s challenging due to the curse of dimen, Stering high 
TER may not be relevant As the number of di "SlOnality wets 
Gata become mcreasingly sparse so that henson mensions increas any 
pairs of = i > istance meas Üs, th 
E g s en meaningless and the ay erage density eis betwees 
niyi — i Ae to be low Therefore, a different Bae, ints ay Where 
wage a etoped for high-dimensional data. CLIQUE * d pr odolog, 

Do ‘Sal subspace clustering methods. which search fo ROCL Ys 
subspaces of the data. rather than over the entire data spa T Cluster, in 
based clasterin poe 


occur frequently. 
*© group objects and generate meaningful clusters. 


(vt) Constraint-based Clustering — This Cluster 


Means for 
straints can 


-ommunicating with the clusterin 


g process. Various kinds of con 
* Specified, either by a user or 


as per application requirements. 


Q.23. Write short note on partitioning methods. (R.GP. V., 


June 2014) 
Ans. Refer to Q.22 (i). 


0.24. Discuss about model based clustering methods. 
(R.GPV., Dec. 2011) 
Ans. Refer to Q.22 (v). 


Q.25. What is clustering ? Briefly describe the Partitioning and 
hierarchical clustering methods. (R.GP.V., June 2016) 


Ans. Refer to Q.1 and Q.22. 


0.26. What is hierarchical clustering ? Differentiate agglomerative and 
divisive hierarchical clustering. (R.GP.V., June 2015) 

Ans. Hierarchical Clustering — Refer to Q.22 (ii). 

Agglomerative hierarchical clustering is a bottom up strategy begins by 
placing each object in its own cluster and then merges these atomic ag 
into larger and larger clusters, until all of the objects are in a single cluster 0 
-ntil certain termination conditions are satisfied. Most hierarchical clustering 
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bee - category. They differ only in their definition of 

pelong 5 ae hierarchical clustering is a top down strategy 
; ae agglomerative hierarchical clustering by starting 
š subdivides the cluster into smaller and smaller 
spans a cluster on its own or until it satisfies certain 
pil Or h as a desired number of clusters is obtained or the 


; c 
P 7 conditione is within a certain threshold. 
A fear 


hierarchical clustering and partition 
(R.GP.V., Dec. 2002) 


sjon clustering techniques partition the database into a 
pete re clusters. They attempt to determine k partitions that 
soe erion function. The partition clustering algorithms are 
i algorithms and k-medoid algorithms. Another type of 
(Fan be k-mode algorithms. E 
g" ical clustering techniques do a sequence of partitions, in 
w ih is nested into the next partition in the sequence. It creates 
j aee iors from small to big or big to small. The hierarchical 
ie ave types— agglomerative and divisive clustering techniques. 
pes? clustering techniques start with as many clusters as there are 
© with each cluster having only one record. Then pairs of clusters are 
sas, merged until the numbers of clusters reduces to k. At each stage, 
pirs of the clusters that are merged are the ones nearest to each other. If 
á mergingis continued, it terminates in a hierarchy of clusters which is built 
ajust a single cluster containing all the records, at the top of the hierarchy. 
risive clustering techniques take the opposite approach from agglomerative 
hnigues. This starts with all the records in one cluster, and then try to split 


jatcluster into small pieces. 


0,28. Write short note on the following — 
(i) k-elusters 

ti) Intra-attribute summary 
- (iii) Cluster projection. 


(R-GP.V., Dec. 2010, June 2014) 
Define the following — 

= () Cluster-projection 4 
: Intra-attribute summary. 


eee 


: 


(R.GRV,, Dec. 2004) 


rs — A cluster on k attributes is called a k-cluster. It is 
pace cluster. 


as, . 
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(ii) Intra-attribute Summary —Let A),......, A, bea set 


í of 0 4 
attributes with domains Dj, ......,D,, respectively, and let D be a dang etic chose” 1 pAM on each sample and returns its best clustering as the output, 
=a ~ £ a - 
intra-attribute summary Ly is defined as — Set The ot, AP cted, CLARA can deal with larger data sets than PAM. The complexity | 


as erpe ‘eration now becomes O(ks* + k(n — k)), where s is the size of the 


J . | 
ty = i |i je {l,..-.,n}, and ix i pf each He the number of clusters and n is the total number of objects, | 


effectiveness of CLARA depends on the sample size. CLARA 
for the best k-medoids among the selected sample of the data set. 
ches cannot find the best clustering if any of the best sampled medoids js 
RA g the best k-medoids. That is, if an object o; is one of the best k- 
sds but is NOt selected during sampling, CLARA will never find the best 
tering. Thisis, therefore, a trade-off for efficiency. A good clustering based 
Seri g will not necessarily represent a good clustering of the whole data 
0! 


where. = {(aj, aj, Y (aii; ai2)) | aji» ai2 € D; and y (ai, aja) > 0} 


(iii) Cluster-projection — A cluster on a database that is a proj 
of the onginal database D is called a cluster projection, with = JECtion 


‘ i ; : Spect t 
attributes present in the projection operation. C; is called the cluster Proje, the 
of C on attribute Aj. Cton 


| 
0.29. Given data items {2, 4, 10, 12, 3, 20, 30, 11, 25}, trace the execnti yif the sample is biased. : i 
es i i th ; on E 
of K-means clustering algorithm for the same (R.GPV., June 2013) 0.31. Explain k-means partitioning method. 
Ans. Given data items {2, 4, 10, 12, 3, 20, 30, 11, 25}. ‘i Or i 
Let the value of K be 2. Explain k-means algorithm briefly. (R.GP.V., June 2011) 
Suppose initial values for means are 2 and 4, then the clusters are K= Ans. The k-means algorithm takes the input parameter, k, and partitions a 
{2, 3} and Ky = {4, 10, 12, 20, 30, 11,25}. The value 3 is equally close to both set of n objects into k clusters so that the resulting intracluster similarity is 
means, so we arbitrarily choose K}. Proceeding in such a way, WE get the high but the intercluster similarity is low. Cluster similarity is measured in 
following — regard to the mean value of the objects in a cluster, which can be viewed as the 
ere re eee cluster’s centroid or centre of gravity. = 
PDS 16 {2, 3, 4} {10, 12, 20, 30, 11, 25} Working of the k-means Algorithm — The k-means algorithm proceeds 
3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25} as follows. First, it randomly selects k of the objects, each of which initially 
4.75 19.6 {2, 3,4, 10, 12, 11} {20, 30, 25} represents a cluster mean or centre. For each of the remaining objects, an 
6 25 {2, 3, 4, 10, 12, 11} {20, 30, 25} object is.assigned to the cluster to which it is the most similar, based on the 
The clusters in the last two steps are identical, thus our final clusters are be a N Ere ics eee ae et bed Se ko ee 
K; = {2, 3, 4, 10, 11, 12} and K; = {20, 25, 30}. m A E 


converges. Typically, the square-error criterion is used, defined as 


Mia > p-m}, 


i=l pj 


0.30. How efficient is the k-medoids algorithm on large data set ? 


Ans. To deal with larger data sets, a sampling-based method, called 
CLARA (Clustering LARge Applications), can be used. The basic idea of 
CLARA is as follows — 

Instead of taking the whole set of data into consideration, a small portion 
of the actual data is chosen as a representative of the data. Medoids are then 
chosen from this sample using PAM. If the sample is selected in a fairly random 
manner, it should closely represent the original data set. The representative 
objects (medoids) chosen will likely be similar to those that would have been 


where E is the sum of the square error for all objects in the data set; p is the 
Point in space representing a given object; and m; is the mean of cluster C; 


-In other words, for each object in each cluster, the distance from the object 
to its cluster center is squared, and the distances are summed. This criterion 
tries to make the resulting k clusters aş compact and as separate as possible. 
The k-means procedure is summarized in fig. 5.4. 


EEE VC E E E aA 


MME SS eae iene erage tc ck ot 
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Atgorithm — k-means. The k-means algorithm for partitioning, 
is represented by the mean value of the objects in the cluster, 
Input— k: the number of clusters, 
D : a data set containing n objects. 
Output- A set of k clusters. 


w 4 Ch 
her e tach Cluster» 
Se ter 


Method — 
@ arbitrarily choose k objects from D as the initial cluster centers: 
(ii) repeat $ 


(ii) (reassign each object to the cluster to which the object is the 
similar, based on the mean value of the objects in the cluster: 

(iv) update the cluster means, i.e., calculate the mean value of the objects 
for each cluster; 

(v) until no change; 


Most 


Fig. 5.4 The k-means Partitioning Algorithm 


Let’s look closes at k-means clustering. Suppose that there 
objects located in space as shown in fig. 5.5 (a). Let k = 3; that i 
would like the objects to be partitioned into three clusters, According to th 
algorithm as in fig. 5.5, we arbitrarily choose three objects as the three e 
cluster centers, where cluster centers are marked by a “+”, Each Sues 
distributed to a cluster such a distribution forms silhouettes encircled by ae 
curves as in fig. 5.5 (a). Next, the cluster centers are updated, Using the new 
cluster centers, the objects are redistributed to the clusters based on Which 
cluster center is nearest. Such a redistribution is encircled by dashed curves ag 
shown in fig. 5.5 (b). This process iterates leading to fig. 5.5 (c). Eventually 
no redistribution of the objects in any cluster occurs and so the Process 
terminates. The resulting clusters are returned by the clustering process. 


Isa Set of 
S, the User 
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(a) (b) 
Fig. 5.5 Clustering of a Set of Objects Based on the k-means Method (The 
mean of each cluster is marked by a “+”) 


The algorithm attempts to determine k partitions that minimize the square- 
error function. It works well when the clusters are compact clouds that are 
rather well separated from one another. The method is relatively scalable and 
efficient in processing large data sets because the computational complexity 
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nm is O(nkt), where n is the total number of objects, k is the 
and t is the number of iterations. 


k-means method, however, can be applied only when the mean of a 
fined. This may not be the case in some applications, such as when 
rical attributes are involved. The necessity for users to specify 
umber of clusters in advance can be seen as a disadvantage. The k- 


data 
the me 
0.32. Explain k-medoids partitioning method. 
í Or 
Write k-medoid’s clustering algorithm. Also explain with the help of 
R.GP.V., Dec. 2008) 
m. le. ( , 
examp Or 


Explain k-medoid algorithm. (R.GP.V., June 2011) 


Ans. In k-medoid algorithms, each cluster is represented by one of the 
clusters located near the center. The partitioning method is then performed 
based on the principle of minimizing the sum of the dissimilarities between 
each object and its corresponding reference point. That is, an absolute-error 
criterion is used, defined as 


where E is the sum of the absolute error for all objects in the data set; p is the 
point in space representing a given object in cluster Cj; and 0; is the 
representative object of C. In general, the algorithm iterates until, eventually, 
each representative object is actually the medoid, or most centrally located 
object, of its cluster. This is the basis of the k-medoids method for grouping n 


objects into k-clusters. 

Let’s look closer at k-medoids clustering. The initial representative objects 
(or seeds) are chosen arbitrarily. The iterative process of replacing representative 
objects by nonrepresentative objects continues as long as the quality of the 
resulting clustering is improved. This quality is estimated using a cost function 
that measures the average dissimilarity between an object and the representative 
object of its cluster. To determine whether a nonrepresentative object, Oe as 
is a good replacement for a current representative object, 0;, the following 


ao ee 


i 
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four cases are, examined for each of the nonrepresentative objects è 
S, Pas sh 
OW 
n 


in fig, 5.6. 


random 


1. Reassigned to 0; 2. Reassigned to 3. No Change 
random 

@ Data Object 

+ Cluster Center 

— Before Swapping 


--- After Swapping 
Fig. 5.6 Four Cases of the Cost Function for k-medoids Clustering 
Case I — p currently belongs to representative object, oj. If 


bY Oandom 2S a representative object and p is closest to one 
representative objects, 0;, 1 # j, then p is reassigned to 0,. 


9; iS replaceg 
Of the othe, 


Case II — p currently belongs to representative object, 0. If oj is re 
by O andom 28 a representative object and p is closest to o 


reassigned to O andom: 


placed 
randoms? then P is 


Case III — p currently belongs to representative object, 0,, i + j. Ifo a 
replaced by O andom 4S 4 representative object and p is still closest to o; ie 
the assignment does not change. 


Case IV — p currently belongs to representative object, 0;, i + j. Ifo. is 
replaced by O andom êS & representative object and p is closets to o , 


p is reassigned to O andom: 


PAM (Partitioning Around Medoids) was one of the first k-medoids 
algorithms, It attempts to determine k partitions for n objects. After an initial 
random selection of k representative objects, the algorithm repeatedly tries to 
make a better choice of cluster representatives. All of the possible pairs of 
objects are analyzed, where one object in each pair is considered a represen- 
tative object and the other is not. The quality of the resulting clustering is 
calculated for each such combination, An object, 0,, is replaced with the object 
causing the greatest reduction in error. The set of best objects for each cluster 
in one iteration forms the representative objects for the next iteration. The 
final set of representative objects are the respective medoids of the clusters. 
The complexity of each iteration is O(k(n — k)*). For large values of n and k, 
such computation becomes very costly. 
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rithm 
Mal objects 
jop4t* the number of clusters, 


a : a data set containing n objects. 


raat . A set of k clusters. 


Method early choose k objects in D as the initial representative objects or seeds; 
(U) 8 


t z 
(ii) ee each remaining object to the cluster with the nearest representative object; 
” raga domly select a nonrepresentative object, O-andom; 


S, of swapping representative object, o;, with ; 
mpute the total cost, 7 Oj, random; 
W rs oa then swap 0; With Orandom to form the new set of k representative objects; 


(viiyuntil no change; 


Fig. 5.7 PAM, a k-medoids Partitioning Algorithm 


0.33. Explain the various categories of partitioning algorithms briefly. 
(R.GPBV., Dec. 2011) 
Ans. Refer to Q.31, Q.32 and Q.30. a 


0.34. How can manage clustering in large database ? 


Ans. The clustering algorithm cannot be appropriate if clustering is 
employed with dynamic databases. Firstly they consider that sufficient main 
memory available to hold the data to be clustered and data structures required 
to support them. These assumptions are not realistic with large databases which 
contain thousands of items. Through the various iterations of an-algorithm, 
continuously performing I/Os, is too expensive. Due to these main memory 
restrictions, the algorithms do not scale up to large databases. Another issue is 
that some assume that the data are present all at once. For dynamic databases, 
these method are not appropriate. Clustering methods should be able to adapt 
as the database changes. 

An issue associated with performing clustering in a database environment, 
It has been argued that to perform effectively on large databases, a clustering 
algorithm should — 


(i) need no more than one scan of the database. 


(ii) have the capability to give status and “best” answer so far during 
the execution of algorithm. This is sometimes known as capability to be online. 


(iii) be stoppable, resumable and suspendable. 


(iv) be capable to update the results incrementally as data are 
removed or added from the database. 


(v) work with limited main memory. 


(vi) be able of performing different methods for scanning the 
database. This may include sampling. 


(vii) Process each tuple only once. 


ry SOW) iaa 
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Current research at Microsoft has tested how to efficiently Perf 
i) = « ac a A a 0) 
clustering algorithms with lange databases. The fundamental idea of this TM the 


Sc 
approach is given below — aling 

(i) Read a subset of the database into main memory, 

Gi) Performing clustering method to data in memory. 

(iii) Combine results with those from prior samples, 

(iv) Then, the in-memory data are divided into three different A 
those that will be saved in a compressed format, those items that wil] ane < 
be required even when the next sample is brought in, those that ma ays 
discarded with suitable updates to data being kept in order to answer t 
problem. Based on the type, each data item is then kept, deleted A 

> Or 


compressed in memory. 
(v) When termination criteria are not met, then repeat from step (i) 


This method has been applied to the k-means algorithm and has be 


3 S 
shown to be effective. 


0.35. What do you mean by BIRCH (balanced iterative reducin 


A S and 
clustering using hierarchies) ? What is the use of BIRCH for large database » 


Ans. A hierarchical-agglomerative clustering algorithm js known as 
BIRCH. Zhang, Livny, Ramakrishnan proposed it. BIRCH is developed to 
cluster large datasets of n-dimensional vectors using a limited amount of main 
memory. BIRCH proposes a special data structure, CF tree, to condense 
information about subclusters of points. The key idea of the algorithm is given 
below — 

BIRCH is based on the principles of agglomerative clustering. That is, at 
any given stage there are smaller subclusters and the decision at the current 
Stage is to merge subclusters based on some criteria. This task is handled by 
BIRCH in a very novel manner. In place of maintaining all the objects of a 
subcluster, birch maintains a set of cluster features of the sub-cluster, The 
criteria for merging two subclusters are so defined that decision to merge two 
subclusters can be taken from the information provided solely by the set of 
CFs of the respective subclusters. One need not refer to the main data objects 
for this task. The cluster features of different subclusters are maintained ina 
CF tree. The nicety of the algorithm is that it needs just one pass to create a CF 
tree, and the subsequent stages works on this tree rather than the actual database. 


The last stage, which the proposers of the theory term as optional, needs one 
more database pass. 


BIRCH Algorithm — The BIRCH algorithm is as follows — 
(i) Start 


(ii) for each t; € D do 
determine correct leaf node for t; insertion; 
(iii) if threshold condition is not violated, then 
add t, to cluster and update CF triples 
(iv) else 
if room to insert t;, then 
insert t; as single cluster and update CF triples; 
else 
split leaf node and redistribute CF features; 
(v) end 


0.36. What do you mean by DBSCAN (density based spatial clustering 
of applications with noise) and CURE (clustering using representatives) ? 
what is the use of both for large database ? 


Ans. DBSCAN - Refer to Q.22 (iit). 


DBSCAN Algorithm — DBSCAN algorithm is as follows — 

(i) Start 

(ii) k= 0; //Initially there are no clusters 

(iii) fori= 1 to n do 

if t; is not in a cluster, then 
X= tt | t is density reachable from ti}; 
if X is a valid cluster, then 
k=k+1; 


(iv) end 


CURE - The basic steps of CURE for large database as follows — 
(i) A sample of the database is achieved. 
(ii) Partition the sample into p partitions of size n/p. This is done to 
speed up the algorithm because clustering is first performed on each partitions, 


(iii) Using the hierarchical algorithm, partially cluster the points in 
each partition. This provides a first guess at what the clusters should be. For 
some constant q, the number of clusters is n/pq. 

(iv) Remove outliers. Two different methods are employed to remove 


the outliers, The first method removes clusters which grow very slowly. If the 
number of clusters is below a threshold, those clusters with only one or two 


(il Sie an ate eae ec 
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items are removed. [tis possible that, close outliers are 
would not be recognized by the first outlier eliminati 
clusters toward the end of the clustering phase, 
method 


Part OF the Sam 
on Methog. Ve Ple ang 
are removed by the. Smal] 
Seco 
nd 
(vy) Completely cluster all data in the sam 
Here, to confirm processing in main memory, the in 


representatives from the clusters found for each 


Ple using CU 
put includes sue Algorith, 


l 
partition durin Ythe Cluste, 


clustering step (11). 8 the Partia] 
(vi) To represent each cluster, cluster the whole data 
using c points. An item in the database is placed in the Cluster ace On disk 
closest representative point to it. To fit into main memory, th ich hag the 
representative points are small enough. So each of the n 0% ESE sets of 
compared to ck representative points. Mts must be 
CURE Algorithm — 
(i) Start 
(ii) T= build (D); 
Q = heapify (D); // Initially build heap with one entry per i 
(iii) repeat ltem; 
u = min (Q); 


delete (Q, u.close); 
w = merge (u, v); 
delete (T, u); 
delete (T, u); 
insert (T, w); 
(iv) for each x € Q do 
x.close = find closest cluster to X; 
if x is closest to w, then 
w.close = x; 
insert (Q, w); 
(v) until number of nodes in Qisk; 
(vi) end 


0.37. Explain about quality and validity of cluster. 


Ans, For supervised classification we 


include a variety of measures to 
evaluate how good our model is — 


(i) Accuracy (ii) Precision (iii) Recall. 


o oS ere eee ge 


Clustering & Association Rule Mining 209 


the analogous question is how to evaluate the 
ting clusters ? While clusters are in the eye of the 


hy do we 
tolde" then WY id finding patterns in noise. 
0 clustering algorithms. 


>) To compare 
si To compare two sets of clusters. 


iil 
t mpare two clusters. 


ees “= 
f t Aspects of Cluster Validation — 
pifferen ermining the clustering tendency of a set of data, i.e., 
er non-random structure actually exists in the data. 
(ii) 
ego toe ; 
(iii) Evaluating how well the results of a cluster analysis fit the data 
ference to external information. 
(a) Use only the data 


(iv) Comparing the results of two different sets of cluster analyses 
(0 determine which is better. 
(v) Determining the correct number of clusters. 


xternally given class labels. 
sults, 


yithout re 


For (ii), (iii) and (iv), we can further distinguish whether we want to 
evaluate the entire clustering or just individual clusters. 


Measures of Cluster Validity — 
(i) Numerical measures that are applied to judge various aspects 
ofcluster, validity, are classified into the following three types. 
(a) External Index — Used to measure the extent to which 
cluster labels match externally supplied class labels. 
(1) Entropy 
(b) Internal Index — Used to measure the goodness of a 
clustering structure without respect to external information. 
(1) Sum of squared error (SSE) 
(c) Relative Index — Used to compare two different clusterings 
or clusters. 
(1) Often an external or internal index is used for this 
function, e.g., SSE or entropy. j A. 
(ii) Sometimes these are referred to as criteria instead of indices. 
(a) However, sometimes criterion is the general strategy and 
index is the numerical measure that implements the criterion. 


o 
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Prob. 1. The heights of players of a school’s basket 
70", 78", 73" and 70". Find the mean height. 
Sol. Given that 
The total number of players = 6 
The heights of players = 72, 74, 70, 78, 75,70 


team are 72 ” 


“here Height of players 
The mean of height = Total number of players 


72+ 74+ 70+ 78+75470 439 
6 ETR 


AJE 


DISTRIBUTED 
_ ALGORITHMS SUCH AS APRIORI AND FP GROWTH 
: _ ALGORITHMS 


Q.38. Explain whether association rule mining 


IS Supervised op 
unsupervised type of learning. 


(R.GPV, May 2019 
Ans. Association rules mining is another key unsupervised data mining 
method, after clustering, that finds interesting associations (relationships 
dependencies) in large sets of data items. The items are stored in the form of 
transactions that can be generated by an external pro 


cess, or extracted from 
relational databases or data warehouses. Due to good scalability characteristics 


of the association rules algorithms and the ever-growing size of the accumulated 
data, association rules are an essential data mining tool for extracting knowledge 
from data. The discovery of interesting associations Provides a source of 
information often used by businesses for decision making. Some application 
areas of association rules are market-basket data analysis, cross-marketing, 
catalog design, loss-leader analysis, clustering, data preprocessing, genomics, 
etc. Interesting examples are personalization and recommendation systems 


for browsing web pages (such as Amazon’s recommendations of related/ 
associated books) and the analysis of genomic data. 


Market-basket analysis, one of the most intuitive applications of 
association rules, strives to analyze customer buying patterns by finding 
associations between items that customers put into their baskets. For instance, 
one can discover that customers buy milk and bread together, and even that 
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nds of milk are more often bought with certain brands of 

nf ue in bread and soy milk. These and other more interesting 

4, e87 Lt ee ie tules can be used to maximize profits by helping to 

st eviously us arketing campaigns, and by customizing store layout. In 

a d bread example, the retailer may not offer discounts for 

time, but just for one; the milk can be put at the opposite end 

tthe ee to bread, to increase customer traffic so that customers 
Ww 


ibly buy more products. 


Write apriori algorithm to find frequent item sets. Also explain 
0.39. j algorithm with the help of suitable examples. 
pe working ? (R.GPY., Dec. 2008) 

Or 

5 iori ithm for association rule mining taking suitable 
pxplain apriori algorithm f (R.GP. v. Nei Ded 2007) 
example. Or 
uss the apriori algorithm for association rule mining with the help 


Dise (R.GP.V., June 2008) 


j nple. 
ofa suitable examp ie 


Write the apriori algorithm for discovering frequent item sets for mining 
ole dimensional boolean association rule and explain it with the help of 
le le (R.GP.V., June 2009) 
an exampte. Or 
Write apriori algorithm. Also demonstrate the working of apriori 
algorithm. (R.GEV., Dec. 2010) 
Or 


Describe in detail the apriori algorithm. (R.GP.V., June 2013) 


Ans. It is also called the level-wise algorithm. It was proposed by R. 
Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean 
association rules. It makes use of the downward closure property. As the name 
suggests, the algorithm is a bottom-up search, moving upward level-wise in 
the lattice. However, the nicety of the method is that before reading the database 
at every level, it graciously prunes many of the sets which are unlikely to be 
frequent sets. 

The first pass of the algorithm simply counts item occurrences to determine 
the frequent itemsets. A subsequent pass, say pass k, consists of two phases. 
First, the frequent itemsets Ly; found in the (k-1)" pass are used to generate 
the candidate itemsets C,, using the Apriori candidate generation procedure. 
Next, the database is scanned and the support of candidates in Cy is counted, 
For fast counting, we need to efficiently determine the candidates = 
contained in a given transaction t. The set of candidate itemsets is subjected to 


PE oi 


eS 
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a pruning process to ensure that all the subsets of the c 
known to be frequent itemsets. The candidate generation Proc, 
pruming process are the most important parts of this algorithm, 
Candidate Generation (Joint Step) — Given Lys 
(k — 1 Hitemsets, we want to generate a superset of the 
itemsets. The intuition behind the Apriori candidate-ge Bie 
that if an itemset X has minimum support, so do all subsets of et z 
assume that the set of frequent 3-itemsets are {1, 2, 3}, 1,2 5) q Let us 
i2, 3, 5}, {2, 3, 4}. Then, the 4-itemsets that are generated as can 3,5) 
itemsets are the supersets of these 3-itemsets and in addition, al] the 3.; a 
Subsets of any candidate 4-itemset must be already known to be in Mise 


Andidate Sets arg l 
a 


te 
ess ang N 
t 
the set of 
all fre, 
eoi all freque F 
neration Tt. 


In 
first part and part of the second part is handled by the Apriori RA The 
generation method. The candidate-generation method is describe d vs 


gen_candidate_itemsets with the given Ly 
Cy = to) 

for all itemsets /,<L,_, do 

for all itemsets eLp; do 


if h1] = L[1] a /4,[2]= L[2] a....ah[k — 1] <L[k- 1] 

then c = /,[1], /;[2]..... 1,[k-1], L[k-1] 

C= Cku {c} 

Using this algorithm — Orr =A RAE S 423, 405 

ESET Sb ege E aes ett, 3, D}ed2, 3, D2 NIS i 
generated from {1, 2, 3} and {1, 2, 5}. Similarly, {2, 3, 4, 5} is generated from 
{2, 3, 4} and {2, 3, 5}. No other pair of 3-itemsets satisfy the condition 

41] =1f1] A 4[2]= L[2] a.....0 h [k — 1] < Lik-1] 


Pruning (Prune Step) — The pruning step eliminates the ex 
(k — 1)-itemsets which are not found to be fre 


counting support. For example, from C,, 
since all its 3-subsets are not in L,. The pru 


prune (C,) 
for all c € C 
for all (k—1)-subsets d of c do 
ifd¢l,, 
then C, = C\{c} 
The Apriori frequent itemset discovery algorithm uses these two functions 


(candidate generation and pruning) at every iteration. It moves upward in the lattice 
starting from level | till level k, where 


no candidate set remains after pruning. 


1 as follows — 


}} is obtained from 
4}}. {1, 25 3h, 5} i 


tensions of 
quent, from being Considered for 


the itemset {2, 3, 4, 5} is pruned, 
ning algorithm is described below- 
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Apriori Algorithm 
=1,C,= all the 1-itemsets; 


* ; k 4 
a as ase to count the support of C} to determine L}. 
the 
read 


= {frequent ]-itemsets} ; 
Ble 9: Ik represents the pass number // 
Kea? 


while (Ly-1 * edo 


oe -= gen candidate_itemsets with the given L,_, 
E i 


rune (Cy) 
r all transactions t € T do EN 
a ment the count of all candidates in C, that are contained in t; 
incre “Spa 
i . = All candidates in C, with minimum support; 

He 
k:= k+ ] 3 
end 
Answer : = Ux Ly; | 
Example — We illustrate the working of the algorithm below. 
k:=1 


Read the database to count the support of 1-itemsets (Table 5.1). The 
frequent 1-itemsets and their support counts are given below. 


Table 5.1 


Li : = {{2} — 6, {3} > 6, {4} — 4, {5} > 8, {6} 5, {7} > 7, {8} > 4}. 
k:=2 
In the candidate generation step, we get 
Cy = {{2,3}, {2,4}, (2, 5}, {2,6}, {2,7}, {2,8}, {3,4}, At eee 
5}, {4, 6}, {4, 7}, {4, 8}, {5, 6}, {5, 7}, {5, 8}, 
Aele lo {6, 7}, {6, 8}, {7, 8}} 
The Pruning step does not change C3. 


Zz ES RRS 
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> BW 2 5) A RR =) AS nGa 
Ly: = {{2,3} +5. 2 t +3, {3,5} = 3, {3572 33 Ret 
2 a3 
3, S 
{S, 7) YS, 
i 35 $ 
k:=3 3, (6,7) 
X 
In the candidate generation step, 3} 
using {2, 3} and {2, 4}, we get {2, 3, 4} 
using {3,5} and {3,7}, we get (3, 5, 7} and 
similarly from {5, 6} and {5,7}. we get {5,6 7 
Gye ={{2;3, 4}, 13535, 78 {5 6 7) 


The Pruning step prunes {2, 3, 4} as not all subs 
{2, 4}, {3, 4} are present in L,. The other two items 


Thus, the pruned C; is { {3, 5, 7}, {5, 6, 73}. ned, 
Read the database to count the support of the itemsets in an 
L, : = {{3, 5, 7} + 3}, 3 10 get 
k: =4 


Since L, contains only one element, C4 is empty and hence 


stops, returning the set of frequent sets along with their a algorithm 
values as IVE Suppor 


L:=L ULU L; 
9.40. Describe the principle of pruning in levelwi 


se algori: 
is its importance ? (R.GEV, Dec. 2 s He a 
Ans. Refer to Q.39. ) 


0.41. Define association rule mining and 
works with suitable example. 


Ans, Refer to Q.10 and Q.39. 


explain how apriori algorithm 
(R.GPV, May 2018) 


0.42. Describe example of data set 
actually increase the cost, 


Ans. Refer to Q.39, 


for which apriori check would 
(R.GPY, May 2019) 


0.43. Describe the candidate generating step of level- 


43 wise algorithms, 
What is its importance ? 


(R.GPV., Dee. 2011) 
Ans. Candidate Generating Step of Level-wise Algorithm — Refer to 
Q.39, under the heading of candidate generation. 

Importance of Candidate Generation — The candidate set generation is 
ortant in improving the performance of frequent set discovery algorithm. 


imp 
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Or 
A BN By 
i the efficiency of Apriori-based mining : 
beret e (R.GP.V., June 2010, 2014) 
Or 
ds to improve the efficiency of Apriori algorithm. 
(R.GP.V., Noy. 2019) 
‘ati iori i have been proposed that 
iations of the Apriori algorithm / 
Bie. eee the efficiency of the original algorithm. Several of these 
cus on imp! 


arized as follows — 
; s are summ 
riation 


(i) 


fo 


ya) 


Hash-based Technique — A hash-based technique can be used 
he size of the candidate k-itemsets, Cp, for k > 1. For example, 
w reduc’ pr each transaction in the database to generate the frequent 
ee aon the candidate 1-itemsets in C}, we can generate all of the 
ee teach transaction, hash them into the different buckets of a hash 
e e and increase the corresponding bucket counts. A 2-itemset 
a ae onding bucket count in the hash table is below the support 
po a ae be frequent and thus should be removed from the candidate 
wae a hash-based technique may substantially reduce the number of the 
e k-itemsets examined. 
Create Hash Table 
H Using Hash 
Function h(x, y) = 
(Order of x) x 10 


+ (Order of y)) 


mod 7 
—_—_—_—_—_—_—_—_—_——— 


(11, 12} | (01, 13} 
{11, 12} | {11, 13} 


Fig. 5.8 Hash Table, H>, for Candidate 2-itemsets 


(ii) Transaction Reduction — \t means reducing the number of 
transactions scanned in future iterations. A transaction that does not contain 
any frequent k-itemsets cannot contain any frequent (k + 1 )-itemsets. Therefore, 
such a transaction can be marked or removed from further consideration because 
subsequent scans of the database for j-itemsets, where j > k, will not require it. 


(iii) Partitioning — A partitioning technique can be used that requires 
just two database scans to mine the frequent itemsets. Fig. 5.9 shows the 
Partitioning. It consists of two phases. In phase I, the algorithm subdivides the 
transactions of D into n nonoverlapping partitions. If the minimum support 
threshold for transactions in D is min_sup, then the minimum support count 
for a partition is min_sup * the number of transactions in that partition. For 


i 


= 
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A Phase 1 
ESAN 
Find The = 


Frequent 
Itemsets 


Combine 
All Local 
Frequent 
Itemsets 
to form 

Candidate 
Itemset 


a random sample S of the given data D, and then s 
in S instead of D. In this way, we trade off s 
efficiency. The sample size of S is such that t 
S can be done in main memory, and so onl 
required overall. Because we are searchin 


he search for freq mise 
in 


Ctions in Si 
Msets in § rather 
frequent it 


g for frequent ite 
than in D, it is possible that we will miss some of the global 


To lessen this possibility, we use a lower Support threshold than m 
Support to find frequent itemsets local to S denoted LS. The rest of the 
is then used to compute the actual frequencies of 


mechanism is used to determine whether all of the glo 
included in LS. If LS 


€Msets, 
inimum 
database 
each itemset in ES. 4 
bal frequent itemsets are 
uent itemsets in D, then 


ynamic itemset counting 
is partitioned into blocks 
date itemsets can be added 
es new candidates itemsets 
se scan. The technique is 
the itemsets that have been 
if all ‘of their subsets-are 
g algorithm. requires fewer database 


rt points. In this variation, new candi 
at any start point, unlike in Apriori, which determin 
only immediately before each complete databa 
dynamic in that it estimates the support of all of 
counted so far, adding new candidate itemsets 


estimated to be frequent. The resultin 
scans than Apriori. 
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iori algorithm for discovering frequent item sets for 


jte the apr ssociation rules and discuss various 
45. write" ensional boolean a stat 
2 şingle-dimensh Kefi (R.GP.V., June 2011) 
gt aches to imp Or a 
M ; ithm ? Give few techniques to improve the efficiency 
wat is apriori algoritm f (R.GP.V., June 2012) 
, pi algorithm. 
jario" ix ; Algorithm — Refer to Q.39. 
ee i fer to Q.44 
e Improve the Efficiency — Refer to Q.44. 
Techniques tomp 


is the ‘Apriori property’ ? How is it used by the Apriori 
0.46. en t are the drawbacks of the apriori algorithm ? 
gorithm ? Wha (R.GP.V., June 2015) 


iori — To improve the efficiency of the level-wise 
A r: ee eae an RA property called the apriori 
Ao ae to reduce the search space. Here, all nonempty subsets of a 
proper SEN must also be frequent. i; 
p ition, if an itemset I does not satisfy the minimum support 
2 a su then I is not frequent; that is P(T) < min_sup. If an item A 
por ‘ie Ea I, then the resulting itemset (i.e., IJA) cannot occur 
p oi than I. Therefore, IVA is not frequent either; that is P(IUA) < 
more 
in sup. 
eek Algorithm — Refer to Q.39. w ge 
Drawbacks — One of the identified drawbacks of apriori algori s is 
i ili i elated data mining. Apriori algorithm not 
ee sees, i i but also scans the database 
only generates a huge number of candidate itemsets but a 
several times. 


0.47. Define a border set. Show that every subset of any item set a 
contain either a frequent set or border set. (R.GP.V., June ') 


Ans. Border Set — An itemset can be viewed as a border set if it is not a 
frequent set, but all its proper subsets are frequent sets. 


Proof — One can notice that if Y is not a frequent itemset, then 5 pe 
contain a subset that is a border set. Since Y is infrequent, it is pi : cet 
isa border set. In that case, the proof is done. Now, suppose : each 
border set too. Hence, there exists at least one proper Ta 4 Sane 
Y|- 1 that is infrequent, say Y'. If Y' is a border set, then : : = geckos 
Now suppose that Y’ is not a border set. By this n TT 
construct Y, Y’, Y", ....., and so on, which contain ne A eane 
neither of these is a border set nor a frequent set and the 
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rocess is get over when we get a set w hich is a 


spstraction pi 
ps ze of the sets by 1 in every step, 


1 

decreasing the § 
ss is become in a finite number o 

construction process f steps, In 


ase, the empty itemsct Is always considered to be a frequent 
case, ” Si 


Q48. Explain in brief about AprioriTid algorithm, 
Ans. Similar to the Aprion algorithm, the AprioriTig ; 
Aprion-gen function to determine the candidate a Borithm i 
is that the database is not used for countin g Suppor wt bu £ 
pass. Instead, set of candidate itemsets is used for this purpose the fn 
case a transaction does not have any candidate k-itemset ae Ofks Lk 
candidate itemsets would not have any entry for that transaction ee Set of 
eventually decrease the number of transaction in the set A which Will 
candidate itemsets as compared to the database. As value of k iments the 
entry will be smaller than the correspondin g transactions because ae Cach 
of candidates in the transactions will decrease. Apriori performs + Dumb, 
AprioriTid in the initial passes but in the later passes AprioriTig than 
performance than Apnori. aS bette 
At first, the entire database is scanned and C1 jg ob 
itemsets. That is, each of C1 has all items along with TID. 
J-item LI are calculated by counting entries of C1, Then, apriori 
used to obtain C2. Entries of C2 corresponding to a transaction T wa is 
by considering members of C2 which are present in T, To perform on 
C1 is scanned rather than the entire database. Afterwards, L2 ie ote task, 
counting the support in C2. This process continues until the candidate Pa by 
are found to be empty. Msets 
The advantage of using this encoding function is that in | 
size of the encoding function becomes smaller than the databa 
much reading effort.In Apriori-TID, the candidate itemsets in 
an array indexed by TIDs of the itemsets in Ck. Each Ck is store 
structure. In the kth pass, Apriori-TID needs memory space for Lk-1 and Ck 
during candidate generation. Memory space is needed for Ck-1, Ck and Ck-1 
in the counting phase. Roughly half of the buffer is filled with candidates at 
the time of candidate generation. This allows the relevant portions of both Ck 
and Ck-1 to be kept in memory during the computing phase. If Lk does‘not fi 
in the memory, it is recommended to sort Lk 


Now in the example below, another set C’ is generated of which each 
member has the TID of each transaction and the large itemsets present in this 
transaction. This set is used to count the support of each candidate itemset. 

The advantage is that the number of entries in C’ may be smaller than the 


number of transactions in the database, especially in the later passes. 


the 


Most 
et, p 


uses the 
difference 


tained in terms of 
Large itemsetg With 


ater passes the 
se, thus saving 
Ck are stored in 
d in a sequential 
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esay, since Apriori-TID uses Ck rather than the entire database 
re we SAY, > 


ass, it is Very effective in later passes when Ck becomes smaller. 
| ja first P rajd that Apriori-TID outperforms Apriori when there are a 
j was also of Ck sets, which can fit in the memory and the distribution of 
gole" pera has a long tail. That means the distribution of entries in large 
peee i at early stage. The distribution becomes smaller immediately 
gemsels 1s = the peak and continues for a long time. It is reported that the 
fet it He of Apriori is better than that of Apriori-TID for large data sets. 
eet hand, Apriori-TID outperforms Aprior! when the Ck sets are 
on Pj eal (fit in memory). Therefore, a hybrid technique “Apriori-Hybrid” 
z also introduced 
w 
ap ems | Itemset | Support | |Itemset | Support| 
00 34 BOS ESTs Eins aay 
300 [23 5 PRY a | a | i aS 
ao 5 Rit Saas EL 
EUGA Se 
BEHA Raia 
| Itemset | Support | Every Subset of 
Ce] mep a Frequent 
Itemset is also 
Frequent 
200 {2 3 5} BC ppe eee 
CE er EA pert ee EE 
| 300_| {12}, {13}, {15}, 23), 25}, {35} | 
5400) 31235 peters Siro ees 


Fig. 5.10 Apriori-TID Algorithm Example 


0.49. Describe method for mining frequent itemsets that does not involve 
the generation of candidate frequent itemsets. Why it is needed ? 
Or 
Write an algorithm for discovering itemsets without candidate generation. 
(R.GP.V., June 2010, 2014) 


Or 

Explain FP growth algorithm for mining association rules in large 

databases. (R.GP.V., June 2009) 
Or 

Explain FP-growth algorithm with an example. (R.GP.V., June 2015) 
Or 

Define FP-growth algorithm. (R.GP.V., June 2016) 
Or 


Discuss FP-growth algorithm with an example. (R.GP.V., Nov. 2019) 


ad 
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Ans. Mast of the previous algorithms suffer from t 
shortcomings — 
G) It may need to generate a huge number of 
there are 10* frequent I-itemsets, the Apriori algorithm Will need t 
more than 107 candidate 2-itemsets. Moreover, if there is a frequeatt Bener, 
100, then roughly 10°° candidate sets are generated in this Process Set Fs. 
Gi) It may need to repeatedly scan the database and check 
set of candidates by pattern matching. It is costly to go over each tra a late 
in the database to determine the support of the candidate itemsets, "Saction 
Keeping this in mind, a new class of algorithms have recent] 
proposed which avoid the generation of large numbers of can didar- been 
One such method is called the frequent-pattern growth or simply FP- a 
which adopts a divide-and-conquer strategy. In this method first, it cone 
the database representing frequent items into a Srequent-patiern tree 
tree, which retains the itemset association information. It then diy 
compressed database into a set of conditional databas 
with one frequent item or “pattern fragment”, and mines 
separately. 
FP_growth Algorithm — Mine frequent itemsets usi 
pattern fragment growth. 
Input : 
(i) D, a transaction database; 
(ii) min_sup, the minimum support count threshold. 
Output : The complete set of frequent patterns. 
Method : 


(i) The FP-tree is constructed in the following steps— 


(a) Scan the transaction database D once. Collect F, the set of 
frequent items, and their support counts. Sort F in support count desce 
order as L, the list of frequent items. 


(b) Create the root of an FP-tree, and label it as “null”, For each 

transaction Trans in D do the following — 
Select and sort the frequent items in Trans according to the order of L. 
Let the sorted frequent item list in Trans be[p|P], where Pp is the first element 
and P is the remaining list. Call insert tree ([p|P], T), which is performed as 
follows? If T has a child NSuch that N.item-name = p.item-name, then,increment 
N’s count by J; else create a new node N, and let its count be 1, its parent fink 
be linked to T, and its node-link to the nodes with the same item-name, via the 
node-link structure. If P is nonempty, call insert_tree (P, N) recursively. 


p5 following tw 
Q 


Candidate Sets : 
S 


Sets, 
wth, 
Pressey 
> Of Fp, 
ides the 
es, each associated 


each such database 


ng an FP-tree by 


nding 


3 } 
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i) The FP-tree is mined by calling FP_growth (FP_tree, null), which 
(i 


llows. 

ented as fo 

F ure FP_growth (Tree, a) 

ntains a single path P then 


for each combination (denoted as $) of the nodes in the path P. 


generate pattern B Ua with support_count = minimum 


ort count of nodes in R; 


pP else for each a; in the header of Tree { 


generate pattern B = ajua with support_count = aj. 


rt count; construct B’s conditional pattern base and then B’s conditional 
OEE 


call FP_growth (Treep, B); } 
Example — Suppose the transaction database given in table 5.2. 
Table 5.2 


A 
A 
A 
A 
@ 
A 
A 
(e 
B 


The frequent 1-itemsets discovered after scanning the whole database are 
(A: 6}, {B : 6}, {C : 4} and {E : 4} where the numbers are occurrence counts 
forthe items. Fig. 5.11 depicts the construction of FP-tree. The first transaction 
Tı: {ABC} is mapped to the single branch of the FP-tree in fig. 5.11 (a). The 
next three transactions, as shown in fig. 5.11 (b), are added by simply extending 
he branch and increasing the corresponding counts of existing item nodes. 
However, the transaction T; contains only one frequent item C and thus a new 
branch is created. The node-link for item C is extended in fig. 5.11 (c) for 
tacking the information of frequent itemsets containing item C. Specifically, 
all the possible frequent itemsets can be obtained by following the node links 
lor the items. Fig. 5.11 (d) illustrates that transactions Tę and T; are added. 
The final FP-tree is constructed by adding the last two transactions Tg and Ty 
as shown in fig. 5.11 (e). To discover all the frequent itemsets, each frequent 
item along with its conditional FP-tree are mined separately and iteratively. 


, Pere Se 
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ane 


Fig. 5.11 FP-tree 


0.50. Define FP-tree. Discuss the method of comp 
(R.GP.V., Dec, 2004, June a; 
Ans. FP-tree — A frequent-pattern tree (or FP tree) is a resent l) 
consisting of an item-prefix-tree and a frequent-item-header table. Item- ture 
tree consists of a root node labelled null and each non-root node eae 
three fields — item name, support count, and node link. Frequent-item-heain 
table consists of two fields, first one is item name and second one is head ¢ 
node link which points to the first node in the FP-tree Carrying the item name 


Method of Computing a FP-tree — Refer to Q.49, 


uting a FP-t1ep 


Q.51. Write short note on performance evaluation of algorithms, 


Ans. Performance evaluation of a number of implementations of different 
association mining algorithms has been carried out. In one study that compared 
a number of procedures like apriori, CHARM (an algorithm for enumerating 
all closed itemsets) and FP-growth methods using real world data as well a 
artificial data, it was concluded that given below — , 

(i) As compare to the best implementation of the apriori algorithm, 
the FP-growth procedure was generally better. ' i 

Gii) As compare to apriori, CHARM was also usually better. 

(iii) In some condition, CHARM was better than FP-growth procedure. 


(fer (v) Atvery 


gin 20? * 


003 performanc 
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to other algorithm, apriori was usually better, when 
d was high because high support leads to a smaller number 

tich suits the apriori algorithm. 

low support, the number of frequent items become large 

lgorithms was able to manage large frequent set gracefully. 

eeg rehensive annual evaluation of software which 

pamore Seis mining algorithms was started in 2003. The first 

ee teased on frequent itemset mining implementations was 
Bo e second was held in November 2004. Many new and 

2003 F ts into association rule mining have been given by these 

rising NSIS a found that two algorithms were perhaps the best in the 

oe evaluation of programs. These were given below — 

(i) An efficient implementation of the FP-tree algorithm. 

(ii) An algorithm which joint a number of algorithms using various 


AS compare 


peur! istics. 


are 


2004 foun 
raversal as 


and 


Algorithms for closed itemset mining as well as for maximal itemset mining 
“cluded by the performance evaluation. The performance evaluation in 
i d an implementation of an algorithm which includes a novel tree 
the most efficient algorithm for finding frequent, frequent closed 


maximal frequent itemsets. 


NUMERICAL PROBLEMS 


Prob.2. Explain the algorithm for mining frequent itemsets without 
candidate generation for the given dataset minimum support value is 2. 


| TID | Items Bought 


(a, c, d, f, g, i, m, p) 
(a, b, c, f, l, m, o) 


l (b, f.n, j, o, w) 
(b, c, k, s, p) 
(a, f, c, e, l, p, m, n) 


(R.GP.V., June 2016) 


TID Items Bought Frequent Itemsets 


Sol. 


100 (a. c, d, f, g, i, m, p) 


(a, f, C, e, L p, m, n) 


(a, b, c, f, Z, m, 0) (c, f, a, b, m, 0A 
(b, f, n, j, 0, w) (£, b, 0, n) 
(b, c, k, s, p) (c, b, p) 


(c, f, a, m, p) 


(c, f, a, 7, p, m, ny 


a E FEF IPI OOO t mOey Hy tee b eee on ms cme, 


pete and Warehousing 
ope Minna 


[E {A, G D} 
{B, C. E} 
{A, B, C, E} 


{B, E} 


(R. 
CEK, June 2n 


uz Hens Bought | Frequent Rama 


{A, C, D} {C} 

{B, C, E} {B, C, E} 
{A, B, C, E} {B, C, E} 
{B, E} {B, E} 


note Attemp 
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B.E. (Eighth Semester) EXAMINATION, June 2011 
(Information Technology Engg. Branch) 
DATA MINING AND WAREHOUSING 
[IT-840(N)] 


t one question from each Unit. All questions carry equal marks. 
Unit-I 

Differentiate between operational database and data warehouse. 10 

(See Unit Il, Page 85, Q.19) 

b) Discuss in detail about data transformation. 10 

(See Unit I, Page 30, Q.31) 

(a) Explain about concept of hierarchy generation for categorical data. 

(See Unit I, Page 43, Q.48) 10 

(b) Briefly describe 3-tier data warehouse architecture. 10 

(See Unit I, Page 11, Q.6) 


(a) 


Unit- 

(a) What is PDA ? How do people use them ? What do you think will 
happen with these products over the next few years ? 10 
** 
(b) Explain the following : 10 

(i) Hypercube (ii) Metadata warehouse (iii) MDDB. 
(See Unit II, Page 84, Q.17) 
(a) Explain multidimensional data models briefly. 10 


(See Unit I, Page 64, Q.74) 
(b) Briefly discuss types of OLAP and also explain the processing of 


OLAP queries. (See Unit II, Page 79, Q.6) 10 
Unit-Ill 

(a) What is data mining ? Briefly describe the components of a data 

mining system. What kinds of patterns can be identified in a data 

mining system ? (See Unit MI, Page 110, Q.11) 12 

(b) “Text mining is different from conventional data mining”. Comment. 

xe 8 

What are the different ways of interfacing a data mining system with 

database systems ? (See Unit III, Page 123, Q.26) 10 
(b) Explain KDD briefly. Distinguish between KDD and data mining. 

(See Unit III, Page 125, Q.29) 10 


(a 


— 


(1) 
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Unit-1\ 


- (a) Discuss the concept of frequent sets 


c 
(See Unit \ 


f Computin 
(Se 


(b) Define a FP-tree. Discuss the method o 


Á 8a Fp Qn) 
Defin e Unity P 

e a border set. Show that every subset of an wr ge 

either a frequent set or border set. (See hens: ¢ 

(b) Write the apriori algorithm for dicovetin ` Page 21 7 


. : . 8 frequent j 
single-dimensional Boolean Association Rules rp Sets for Mini 
nd disc 
u 


approaches to improve its efficiency. (See Unit V, p SS vq 
Unit-V fioi 

9. (a) Discuss the importance of similarity Metric in cluster 
difficult to handle categorical data for clus stering, Why is i 


8. (a) 


ig 
Toys 
> Q45) 19 


tering ? 

; (See Unit 10 
(b) Explain k-means algorithm briefly. (See Unit y m V, Page 182,Q) 

P ` sta 
10. (a) Explain k-medoid algorithm. (See Unit y . 201,Q.31) 10 
(b) Compare and discuss the various approaches se Hirt Q.32) 19 

i ring. 

(See Unit y, Page 195 i i 
ASe ) 


B.E. (Eighth Semester) EXAMI 
(Information Technology n TION, Dec, 


DATA MININGAND WARE eo o anch) 


HO 
(Elective-IV) ii 
(IT-840) 
Note : Attempt one question from each Unit. All questions 
; c 
ad arry equal marks, 
1. (a) Explain the use of facts, dimensions and attributes in the star sche 
ma 
| (See Unit I, P 
(b) Give three examples of problems likely to be pe Pe ve 


operational data are integrated into the data warehouse, 10 


(See Unit I, Page 15, Q.13) 
as for multidimensional 
(See Unit I, Page 50, Q.58) 10 
nce improvement techniques 
(See Unit I, Page 47, Q.53) 10 


2, (a) Explain snowflake and galaxy-schem 
databases. 


(b) Discuss the most common performa 
used in star schemas. 


i l Unit-I 
- (a) What is OLAP and what are its main characteristics ? 5 
(2) (See Unit II, Page 75, Q.1) 


Data Mining & Warehousing 


in the relational database environment ? 5 
(See Unit II, Page 82, Q.15) 


sional cubes and describe how the slice and dice 
(See Unit II, Page 90, Q.23) 10 


P isu ed 
RO! A 1S S 


in multidimen 
«) os fits into this model. 


techn! ' 20 

fly discuss the following ~ see Unit II, Page 82, Q.13 

i pne advantages of ROL AP and MOLAP (See Unit II, Page 82, Q.13) 
A Reasons for data partitioning 


) Discretization (d) Ice-berg query. 
c 


(See Unit I, Page 41, Q.46) 


Unit-I] 
Explain the four key consideration in any data mining programming. 
5, (a) me is lift and why does it matter ? 10 
(b) Discuss the application of data mining in the banking industry. 


(See Unit III, Page 113, Q.16) 10 

(a) Discuss the issues and challenges in data mining. 10 

(See Unit III, Page 129, Q.34) 

(b) Explain web usage mining and spatial mining briefly. ** 10 
Unit-IV 

1, (a) Describe the candidate generating step of level-wise algorithms. What 

is its importance ? (See Unit V, Page 214, Q.43) 10 

(b) Explain multi-level association rule. (See Unit V, Page 189, Q.15) 10 


6. 


g, (a) Discuss the importance of discovering association rules. 10 

(See Unit V, Page 188, Q.13) 

(b) Explain time series mining association rules. sw TO 
Unit-V 

9, (a) Why is Naive Bayesian classification called Naive ? Briefly outline 

the major ideas of Naive Bayesian classification. 10 

(See Unit IV, Page 144, Q.12) 

(b) Discuss about model based clustering methods. 10 


(See Unit V, Page 198, Q.24) 


10. (a) What is cluster analysis ? What are some typical applications of 
clustering ? What are some typical requirements of clustering in data 

mining ? (See Unit V, Page 183, Q.5) 10 

(b) Explain the various categories of partitioning algorithms briefly. 10 
(See Unit V, Page 205, Q.33) 


**Now, according to new revised syllabus of R.G P.V., it is not included in syllabus 
(3) 


Note : Attempt any five questions. All questions ¢ 


1. 


6. 


arTy equal Marks 
also write th 
m © appli 
See Unit I, Page 16 ws “i 
d data Mart with an exa i 
(See Unit 1, Page 19 ma 
System : 


(a) Last out the advanced database systems and 
advanced database system with example. ( 
(b) Explain the concept of data warehouse an 
(a) Consider the case of a bank credit card Q.16)9 
(1) Give 3D data elements and 2 fact elements th 
database for this data warehouse. Draw 
(u) Suggest an aggregate that would apply 


Compare and contrast the two basic w 
warehouse. 


(b) 
(a) Wnite the classification of data mining Page 113, Q17 
(i) The Models (ii) Types of data (Gii 
(iv) Level of abstraction of knowledge 
(b) Explain association rule in mathematic 
and confidence in association rule mining. 
What is apriori algorithm ? Give few 
efficiency of apriori algorithm. 
What are multidimensional associatio 
to mining multilevel association rul 
Explain about the following : 2 
(i) Algorithm for generating decision tree 

(ü) Pre-pruning and post-pruning approach. 


(a) 


(See Unit V, Pa 
n rules ? Mentio 
e. (See Unit V, 


di ge 217, Q.45) 10 


n few approaches 


Page 191, Q.17) 10 
(a) 


(See Unit IV, Page 162, Q.32) 
ments of cluster analysis ? 8 
(See Unit V, Page 183, Q.4) 
e the two approaches used 

** 10 


(b) Define Clustering. What are the require 


(a) What is the use of Regression ? What ar. 
by regression to perform classification ? 


* 


*Now, 


according to new revised syllabus of R.GP.V., it is not included in syllabus 
(4) 
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me series analysis ? What are the various detected 
1 ** 10 
e series ? 


n 
iptive and predictive model in data mining. 
peime di (See Unit IN, Page 117, Q.20) 
LAP 
ic Vi ta warehouse ? Define ROLAP, MOLAP and HO 

What chi a (See Unit II, Page 81, Q.11) 12 
et = k five criteria for the evaluation of classification and 
Spec! a 

i prediction. 


; (a) 


` 


(b 


B.E. (Eighth Semester) EXAMINATION, June 2013 
(Information Technology Engg. Branch) 
DATA MINING AND WAREHOUSING 
(IT-840) 


i I al marks. 
ate: Attempt all questions. All questions carry equ 
ee Saris the overall and typical architecture of data warehouse. 10 
aes (See Unit I, Page 11, Q.6) 
i i house. 10 
ite in detail about the key components of data ware’ 
iki (See Unit I, Page 5, Q.3) 
Or 
2. (a) Explain all steps in designing star schema. (See Unit I, Page 47, Q.52) 10 
(b) Discuss the features of the multidimensional model. 10 
(See Unit I, Page 66, Q.76) 
i i il wi f suitable 
j hitecture of MOLAP in detail with the help o 
sae oo = (See Unit II, Page 83, Q.16) 10 
(b) What do you understand by the term OLAP ? a arco a 
tions which are supported by the tools. 
———— i (See Unit II, Page 90, Q.22) 
Or , 

4. (a) Describe the query management process. (See Unit 1, Page = in 10 
l i itecture with the help of schematic 
(b) ia OLAP software architec halet y tipe 
iagram. i 
5. Write shoi notes on — E 12. aay 
casei (See Unit V, Page 181, Q.1) 

(b) Clustering niin 
i fR.GP.V., it is not included in syllabus 

**Now, according to new revised see 


10. 


**Now, according to new revised syllabus of R.GP.V., it is not included in syllabus 


(c) Regression 

(d) Generalizanon 

(e) Sequence discovery : 
Or 

What are the differences between - 

(a) Data mining versus query tools 

(b) Data mining and machine learning 


(c) Relational DBS and spatial DBS 
(d) Data miming versus knowledge discovery. 
(See Unit 11, p, 
(a) Discuss the detailed concept of mining multidimensional, 126, Q31) 
rules from relational databases. (See Unit V, Page 191 station 
(b) Describe in detail the apnion algorithm. (See ț nit V, Page ii Q.17) 19 
Or Q39) 10 
(a) Explain in brief that how we can mine different kind of associar; 
rules. l _ (See Unit V, Page 194 uea 
(b) Give brief note on latest trends in association rule mining. = 10 
(a) What is decision tree ? Explain how classification is done usin d a 
tree induction. (See Unit IV, Page Pte 
(b) Given data items {2, 4, 10, 12, 3, 20, 30, 11, 25), trace hes -19) 10 
of K-means clustering algorithm for the same. iir- 
0 
(See Unit V, Page 200 
Or Be 200, Q.29) 
(a) Discuss the important issues to be addressed by data cluster; 
system. a 
(b) Write short notes - ig 


(i) Bayesian classification 
(ii) Associative rule based classification 


(See Unit IV, Page 145, Q.13) 


ak 


B.E. (Eighth Semester) EXAMINATION, June 2014 
(Information Technology Engg. Branch) 
DATA MININGAND WAREHOUSING 
(IT-840) 


Note : Attempt any one question from each unit. All questions carry equal marks, 


UNIT-I 
(a) What is data warehouse ? How is a data warehouse different froma 
database ? (See Unit II, Page 87, Q.20) 


(6) 


Data Mining & Warehousing 


ate between star-snowflake schemas with the help of 


(b) Different! (See Unit I. Page 49, Q.55) 


les. 
examp Or 


What is data warehouse ? Discuss a three tier data warehouse. 
(See Unit I, Page 13, Q.7) 
is data warehouse different from a database ? How are they 
(b) rt (See Unit II, Page 85, Q.19) 


(a) 


UNIT-II 
Discuss various types of OLAP servers. How are the data actually 
stored in different server architectures ? (See Unit H, Page 80, Q.10) 
(b) Briefly compare the following concept. 
(i) ROLAP versus MOLAP versus HOLAP servers. 
(See Unit II, Page 80, Q.8) 
(ii) Rolp-up, Drill-down, Slice and Dice OLAP operations. 
(See Unit IL, Page 87, Q.21) 


(a) 


Or 

(a) What is meant by data warehouse schemas ? Draw schematic 

diagrams of its various term. (See Unit I, Page 50, Q.57) 
(b) Describe the following term 

(i) Data cube 

(ii) Data warehouse architecture. 

UNIT-II 

(a) What is meant by data transformation ? (See Unit I, Page 30, Q.31) 
(b) Describe the issue to be considered during data integration. 


(See Unit I, Page 29, Q.29) 


(See Unit I, Page 66, Q.75) 
(See Unit I, Page 11, Q.6) 


Or 
(a) What do you understand by dimensionality reduction ?Discuss any 
two methods of dimensionality reduction. 
(See Unit I, Page 35, Q.38) 
(b) Why preprocessing of data is required ? What are the various forms 
of data preprocessing ? (See Unit I, Page 24, Q.24) 
UNIT-IV 

(a) How can we improve the efficiency of apriori based mining ? 
(See Unit V, Page 215, Q.44) 
(b) Describe the principle of pruning in levelwise algorithms. What is its 


importance ? (See Unit V, Page 214, Q.40) 

Or 
(a) Write an algorithm for discovering itemsets without candidate 
generation. (See Unit V, Page 219, Q.49) 


(7) 


Data Mening & Warehousing 


(b) Discuss mning of mululevel association Tules 
redundant mulnlevel association rules 

UNIT-N 
9. (a) Wnte an algorithm for decision tree Inducti 
charactenstes of decision tree induction 


(b) What are the different categories of 


Or Page 195 


10. Wnite short note on any four of the following = =a 
(a) Prediction Analysis 

(b) Cluster projection (See Unit y, Pa + 

(c) K-cluster (See Unit y = 199, Q.28) 

(d) Intra-attmbute summary (See Unit e a 199, Q.28) 

(e) Partitioning methods. (See Unit Vv Page = Q.28) 

` »Q.23) 


B.E. (Eighth Semester) EXAMIN 
(Information Technology ATION, June 2015 


DATA MININGAND WAREN OC ane) 


Note : Attempt all questions. All questions carry equal marks 


1. (a) How is a data warehouse different from a database 9 
similar to each other ? (See Unit, Page ne ai fiy 
(b) Describe three data warehouse models — The enterprise w 9 ms 
the data mart and the virtual warehouse (See Unit I, Page fas rs 
Or % g 3 Q.15) 7 
2. (a) Discuss system development life cycle of a data w 
j areh 
factors should be considered while designing a data a ial 
(See Unit I, Page 13, Q.10) 
(b) Describe star schema and snowflake schema with examples, 7 
(See Unit I, Page 49, Q.55) 
3. (a) Why most data warehouse system support index Structures ? Discuss 
methods to index OLAP data. (See Unit I, Page 56, Q.64)7 
(b) Discuss typical OLAP operations in brief.(See Unit II, Page 87, Q.21)7 
Or 
4. (a) Define OLAP. What are the four different types of OLAP server 
from implementation point of view ? Explain briefly. 7 
= (See Unit II, Page 80, Q.9) 
Now, according to new revised syllabus of R.GP.V., it is not included in syllabus 
(8) 


i 
Í 
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How the different types of models apply to the iici 
œ) BoY ent ? , 

E asip issues in data mining. (See Unit III, Page 129, Q.34) 7 
gj X " ou mean by data reduction ? What are the strategies of the 
E D W nction? (See Unit I, Page 31, Q.33) 7 

da Or 


Explain briefly 

(i) Text miming 

(ii) Web usage mining 

(iii) Spatial mining 

(iv) Web structure mining E RO 
) What is the ‘apriori property’ ? How is it used by the apriori algorithm ? 

a What are the drawbacks of the apriori algorithm ? 7 

(See Unit V, Page 217, Q.46) 


kk 
ak 
kk 


k*k 


(b) What do you mean by association rule mining ? Give an example of 
market basket analysis from the real world. (See Unit V, Page 186, Q.10) 7 

Or 
g. (a) Explain Fp-growth algorithm with an example. 7 
(See Unit V, Page 219, Q.49) 
(b) Discuss the latest trends in association rule mining. 27 

9, (a) Suppose we have the following points : 

(1, 1), (2, 4), (3, 4), (5, 8), (6, 2), (7, 8). Use k-means algorithm (k = 2) 
to find two cluster. The distance function is euclidean distance. Find 
2 cluster using k-mers clustering algorithm. Use (1, 1) and (2, 4) to 
form the initial clusters. 7 
(b) What are the requirements of clustering in data mining ? 7 
(See Unit V, Page 182, Q.3) 

Or 
10. (a) Why is decision tree induction popular ? Discuss over fitting of an 
induced tree and two approaches to avoid over-fitting using suitable 
examples/diagrams. (See Unit IV, Page 155, Q.26) 7 
(b) What is hierarchical clustering ? Differentiate agglomerative and 


divisive hierarchical clustering. (See Unit V, Page 198, Q.26) 7 


Note : 


(i) Answer five questions. In each question part A, B, C, is 
compulsory and D part has internal choice. 


**Now, according to new revised syllabus of R.GP.V., it is not included in syllabus 
(9) 


Data Mining & Warehousin 
Date Ming & Warenne g g 


Gh AN parts ofeach questions are to be attempted z Unit-I\ | 
i) AN qeesnons cam equal marks. out of. ied One Place | Define support and confidence in association m e miming 
Mar 30 wonds) cam 2 marks, part C (Max mee A ang , . (See Unit V, Page 188, Q.12) 
3 marks. part D (Max. 400 words) cam 7 marke res) ton What are the latest trends in association rules mining ? *x 
(iv) Except numencals. Denvanon. Design and fice, ee ()) fine FP-growth algorithm. (See Unit V, Page 219, Q.49) 
Unit-1 TRE etc a) on the algorithm for mining frequent item sets without candidate 
1. (a) What ts data mart ? What are the types of data mart > | (d) oF en for the given dataset minimum support value is 2. 
(See | | 
(b) What are steps involved in clean and tanformaton of 57.9 65) | Items bought 
2 (See l nit I Pa k | , 
ic} List the contexts of dimension of table. Belg | (a, PEA 
: : | a, b, c, f, /, m, 
tan Dram the dara warehouse architecture and explain its comp | ©. mjo, W 
i (Ser EMN E Pageta ma (b, c, k, s, p) 
Grve reason. why it is necessary to Separate da | ee 
operational database. i (Seri E Lpo from (See Unit V, Page 223, Prob.2) 
Unit-H 5e 15,Q13) | Describe the algorithm for time series mining association rules. 
2. (a) What is the different between OLTP and data Warehouse 9 Unit-V 
(See Unit II, Page 85 Qı 5, (a) What is meant by outlier ? we 
(b) List the difference between OLAP and OLTP “ena (b) How is the zero frequency problem handled in naive bayes classifier ? 
(See Unit I, Page 85, ax (See Unit IV, Page 145, Q.15) 
tc) How can the data warehouse data be accessed efficiency 9 (c) List out difference between clustering and classification. 
(d) Discuss the methods for efficient computation of data Cubes (See Unit V, Page 183, Q.6) 
(See Unit I, Page 54, Qá (d) What is clustering ? Briefly describe the partitioning and hierarchical 
Or clustering methods. Give examples in each case. 
Write short notes on — (See Unit V, Page 198, Q.25) 
(i) ROLAP Or 
(ii) MOLAP. Discuss in detail about the Bayesian and decision tree classifier. 
(See Unit II, Page 80, Q.7) (See Unit IV, Page 149, Q.20) 
Unit-III 


3. (a) List any four data mining application. (See U 
(b) Write the difference between database and 


nit ITI, Page 113, Q.15) 
knowledge base, 


(See Unit III, Page 110, Q.10) 


IT-840 
B.E. (VII Semester) Examination, June 2017 


Data Mining & Warehousing 
(c) What are goals of web usage mining ? 


ik (Elective-iv) 
(d) Explain in detail about text mining applications. - sia det 
Or Note : Total Number of question 07. Attempts any five questions (including 
What is data mining functionality ? Explain different types of data all parts). Ak gpa epa aA, e ee ii 
mining functionality with examples. (See Unit IHI, Page 118, Q.22) ang aniabla, 


“*Now, according to new revised syllabus of R.GP.V. it is notincludedinsyllabus | "Now, according to new revised ares ein i 
(10) 


Data Mining & Warehousing 


I. 


ta) 


{b} 


(a) 


(b) 


(a) 


(b) 


(a) 


(b) 


Write short notes on any two of the following — 


What are the differences between the three 
warehouse usage - information processin 
and data mining ? (See Unit |. p 
Brefly describe the similarities and differences be 
snowflake schemas. (See Unit 1, p 
Dlustrate how each of the following funct 
in MOLAP. 

(i) The generation of a data warehouse 
(ii) Roll-up 

(mi) Dnll-down. 


age 47 
10ns May b ` Q.54) 7 


7 


P . (See Unit Il, Pa 
Explain in detail about data discretization and con cept hi 
lerarchy 


generation. (See Unit I, Pa 
Describe challenges to data mining rëgardin or 9. Q45)7 
methodology and user interaction issues. = aia Mining 


a Q.34) 
ttribute Subset 


age 34, Q.36) 7 
10Ns to find all 
7 


ge 90, Q.24) 


(See Unit I, Page 129 
&sgregation and a 
(See Unit I, p 
Wing transact 


Explain in detail about data cube a 
selection. 

Apply Aprion algorithm for the follo 
frequent itemsets with min_sub = 3. 


3) 
s dimensionality reducti 
. c 
techniques. (See Unit I, Page 36, Q397 
What are the advantages of decision tree induction ? 


(See Unit IV, Page 149, me. 
runing ? 4 
; (See Unit IV, Page 162, Q.31) 
Define outlier and challenges of outlier detection ? Evaluate what 
information is used by outlier detection method ? +7 
Evaluate the need for data preprocessing by taking example of an 
application of your choice. (See Unit I, Page 22, Q.22)7 
What are the features of Bayesian classification explain in detail? 7 


What are the two approaches to tree p 


7 each 
(a) Data mart (See Unit I, Page 57, Q.65) 
(b) Confusion matrix ae 


—S 


**Now, according to new revised syllabus of R.GP, it is not included in syllabus 


(12) 


) Metadata repository. 
(c 


Data Mining & Warehousing 


(See Unit I, Page 63, Q.73) 


IT-840 
.E. (VIII Semester) EXAMINATION, May 2018 
i Grading System (GS) 
DATA MINING & WAREHOUSING 
(ELECTIVE-IV) 


(i) Attempt any five questions. 


Note : 
(ii) All questions carry equal marks. 
What is data warehouse ? Explain the data warehouse architecture 
1, (a) with diagram. (See Unit I, Page 13, Q.7) 
(b) Discuss star, snowflake and galaxy schema for multidimensional 
database. (See Unit I, Page 50, Q.57) 
2. (a) Explain OLAP operation in multidimensional data model. 
(See Unit IT, Page 87, Q.21) 
(b) Explain the different tools required to manage a data warehouse. 
(See Unit I, Page 19, Q.17) 
3, (a) Describe the various techniques for data transformation. 
(See Unit I, Page 30, Q.30) 
(b) Define data reduction. Explain different technique for data reduction. 
(See Unit I, Page 31, Q.33) 
4. (a) Briefly explain mining of spatial database and text database. kk 
(b) Explain what the major classifications of data mining system are. 
(See Unit II, Page 121, Q.23) 
5. Define association rule mining and explain how apriori algorithm works 
with suitable example. (See Unit V, Page 214, Q.41) 
6. (a) Explain the major classification of clustering methods. 
(See Unit V, Page 195, Q.22) 
(b) Discuss Bayesian classification. (See Unit IV, Page 145, Q.13) 
7. (a) Discuss about prediction method using Cluster analysis with an 
example. ii 
(b) Explain outlier analysis with examples. we 
8. Discuss about classification method using decision tree induction with 


example. 


(See Unit IV, Page 148, Q.19) 


**Now, according to new revised syllabus of R.GP.V.), it is not included in syllabus 


(13) 


Note : 


(b) 


(a) 


(b) 


(a) 


(b) 


(a) 


(b) 


(a) 


(b) 


(a) 


(b) 


CS-8003 (2) (CBGS) 
B.E. VII Semester 
EXAMINATION, May 2019 
Choice Based Grading System (C 
DATAMINING BGS) 


(i) Attempt any five questions 
(ii) All questions carry equal marks 


What a neat sketch explain the architecture of a data Warehougs 
©. 7 


(See Unit I, Page 11 Q6 

Explain the design and construction of a data warehouse. j 
(See Unit L Page 44, Q. 49) 

List out the differences between OLTP and OLAP 7 


(See Unit II, Page 85, Q.19) 


Discuss the various schematic representations in multi-dimensionay 
model. (See Unit I, Page 50, Q.57)7 
Explain mining multi-dimensional Boolean association Tules from 
transaction. (See Unit V, Page 192, Q.18)7 


Is the data warehouse a prerequisite for data mining ? Does the data 
warehouse helps data mining ? If so in what ways ? 7 


(See Unit IN, Page 122, Q.25) 
Explain whether association rule minin 


g is supervised or 
unsupervised type of learning. 


(See Unit V, Page 210, Q.38)7 


The heights of players of a school’s basket ball team are Tae 


, 14”, 
70”, 78”, 75”, and 70”. Find the mean height. 


7 
(See Unit V, Page 210, Prob.1) 


Explain the algorithm for constructing a decision tree fro 


m training 
samples. 


(See Unit IV, Page 149, Q.23) 7 
Explain the methods for computing best split. 7 


(See Unit IV, Page 155, Q.27) 
Explain different data types used in clustering. 7 


(See Unit V, Page 185, Q.9) 


Explain briefly the differences between “classification” and 
“clustering” and give an informal example of an application that 


(14) 


a 


Note : 


Briefly explain the Data Warehouse Models with an example. 


(a) 


(b) 


(a) 


(b) 


(a) 


(b) 


Data Mining & Warehousing 
would benefit from each techniques. (See Unit V, Page 185, Q.8) 7 


Describe example of data set for which apriori check would actually 
increase the cost. (See Unit V, Page 214, Q.42)7 


Discuss the typical OLAP operations with an example. 7 


(See Unit II, Page 87, Q.21) 
Describe different data cleaning approaches. 7 


(See Unit I, Page 28, Q.28) 


Can you briefly describe the four stages of knowledge discovery 
(KDD) ? Can you describe the multi-tiered data warehouse 
architecture ? 


(See Unit IIL, Page 128, Q.33) 7 


CS-8003 (2) (CBGS) 
B.E. VIII Semester 
EXAMINATION, November 2019 
Choice Based Grading System (CBGS) 

DATA MINING 


(i) Attempt any five questions. 
(ii) All questions carry equal marks. 


14 


(See Unit I, Page 18, Q.15) 
Describe OLAP operations in Multidimensional Data Model. 8 


(See Unit II, Page 88, Q.21) 
Write the difference between OLAP. and OLTP. 6 


(See Unit II, Page 85, Q.19) 

What is the need of Data Preprocessing ? Discuss various forms of 
preprocessing. (See Unit I, Page 24, Q.24) 14 
Discuss FP-growth algorithm with an example. 7 


(See Unit V, Page 219, Q.49) 
Explain the methods to improve the efficiency of Apriori 
Algorihtm. (See Unit V, Page 215, Q.44)7 


Discuss the concept of Naive Bayes Method for classifying data 
tuples. (See Unit IV, Page 143, Q.11) 8 


Discuss categorization of clustering methods. 6 


(See Unit V, Page 195, Q.22) 


(15) 


Dare Mining & Werehousing 


e. 


G 


Write a short notes on following 


(a) Supervised and Unsupervised Learnin 


en Xampleg 

LAP Servers (See Unit IV, Pape 
(b) C Servers (See Unit t p Re 135 
(© Mulnlevel Associating Rules. (See Unit y i age 79 
Bnefly explain major issues and challenges of Data Ni Age 189, 

z Ning, 
(See Unit Hy 
Wnte a short note on following — Page 129 
(a) Data Mining Functionalities (See Unit 1 " 
(b) Data Warehouse Architecture. fae “i fe 118,Q 
1 


(16) 


a 


110 Dete Mining and Warehousing Introduction to D ata & Data M 
f. > ala ining 11 
found pattems, On the basis of the data mining impleme ‘fines company increases sales to bu 1 


ntg å 
the pattem evaluation module may be integrated with the eri Methog ù Ti “ patterns of frequent flyers, NESS travelers by 

better for fast data mining to push the evaluation of Pattern inti Modul 4 g rave : pe facnarer of diesel engines increases sal 
deep as possible into the mining process. As a result, the Search si “inene, k i he basis of patterns found from aie Py forecasting 
to the interesting pattems Mite on ines OF Storical data of truck 

(vi) User Interface — This module is used for communicat ajor banking corporation with investme 
users and data mining system. It permits the user to communicat. are (ix) A i the leverage of direct marketing cam i and financial 
system by speciðing a data mining query or task and offers in fas With thet De ris uncover clusters of customers with high ae Predictive 
help focus the search. Italso performs exploratory data mining on the wi i rit fe Widompany manufacturing edian ge etime Values, 
the intermediate data mining results. Besides, this component evaluate asis gf (x) P R a orders aid a ae (eee s, the shipping 
patterns, visualize the patterns in various forms and permits the user ton pt regu y Ere bills: Datamining deea d ors between the 
database and warehouse schemas or data structures. TOW. ders an d 3 € criminal behaviour 

pe ring patterns of orders and premature Inventory reductions. 
O16. Write the difference between database and knowledge p ae cove 


tare the limitations of data mining ? (R.Gp V, D 
(RGPV, June y| 9 pe ec. 2010) 


Q11. What is data mining ? Briefly describe the componen A f „sufficient applications. To be successful, data mining needs skilled technical 


, l i ialists who can structure the analysis and i 
ag e cahte tdentineg oo fa dan id alytical specia ysis and interpret the output 


@ minin ic create 
system ? (R.GPV, June 201 fat a related, instead of technology-related. 


Ans. Data Mining — Refer to Q.1. a though data mining can aid disclose patterns and relationships, it does 


ell the user the value or significance of these patterns. These types of 
7 minations must be done by the user. Similarly, the validity of the patterns 
a is dependent on how they comparato “real world” situations. For 
sample, to assess the validity ofa data mining application designed to identify 
| potential terrorist suspects in a large pool of individuals, the user may test the 
nodel using data that incorporates information about known terrorists. 
owever, while possibly re-affirming a particular profile, it does not necessarily 
neanthat the application will identify a suspect whose behaviour significantly 

jwiates from the original model. 
zs epartm . : : | Another limitation of data mining is that while it can identify connections 
Ee liao neon e pan tween behaviours and/or variables, it does not necessarily identify a causal 
(iii) A major bank prevents loss by detecting early warning signs of eosin pt cemple, an PP lication may ee: that Sates af 
attrition in its checking account business. haviour, such as the propensity to purchase airline tickets just shortly before 
F 5 i y : lie flight is scheduled to depart, is related to characteristics such as income, 
a oae Pen pany enhances direct mail promononsi belofeducation, and Internet use. However, that does not necessarily specify 
PE ne ie os a bl olid les from the p Me tsket purchasing behaviour is caused by one or more of these variables. 
catalog sales company doubles its holiday sales from %8 | hit the individual? : dabe etita aiobick 
previous year by predicting which customers would use the holiday catalog. E oomi A family status (a sick 
m P vist anmpeasice Bigyices saves huge -amom i "itive needing care), or a hobby (taking advantage of last minute discounts 

money by detecting fraudulent claims. visit new destinations), 


Aas. Refer to Q.9. 


Components of a Data Mining System — Refer to Q.9. 
Data mining functionalities include the discovery of concept/clag, 


descriptions, associations and correlations, classification, prediction, cluste 
trend analysis, outlier and deviation analysis. 


0.12. What are the benefits of data mining ? 


Ans. The types of benefits actually realizable in real-world situations ar 
as follows — 


Ting 


G) A supermarket chain enhances earnings by rearranging the 
shelves depend on discovery of affinities of products that sell together. 


